• how
  • to
  • use
  • Ai
  • models

How to Use AI Models in Cloudflare Through HTTP Requests

Amit Hariyale

Amit Hariyale

Full Stack Web Developer, Gigawave

8 min read · June 14, 2025

how to use Ai models in cloudflare through http requests matters in real projects because weak implementation choices create hard-to-debug failures and inconsistent user experience.

This guide uses focused, production-oriented steps and code examples grounded in official references.

Key Concepts Covered

modelscloudflarethroughhttprequests
  • Cloudflare Workers AI REST API
  • Bearer token authentication
  • Model inference envelope pattern
  • Server-Sent Events (SSE) streaming
  • Token usage tracking
  • Edge deployment patterns

Context Setup

We start with minimal setup, then move to implementation patterns and validation checkpoints for how to use Ai models in cloudflare through http requests.

Problem Breakdown

  • Unclear setup path for how to use Ai models in cloudflare through http requests
  • Inconsistent implementation patterns
  • Missing validation for edge cases

Solution Overview

Apply a step-by-step architecture: setup, core implementation, validation, and performance checks for how to use Ai models in cloudflare through http requests.

Step 1: Define prerequisites and expected behavior for how to use Ai models in cloudflare through http requests.

snippet-1.js
1// ai-client.js 2const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID; 3const CF_API_TOKEN = process.env.CF_API_TOKEN; 4 5export async function generateText(prompt, options = {}) { 6 const model = options.model || '@cf/meta/llama-3-8b-instruct'; 7 const maxTokens = options.maxTokens || 512; 8 9 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 10 11 const response = await fetch(url, { 12 method: 'POST', 13 headers: { 14 'Authorization': `Bearer ${CF_API_TOKEN}`, 15 'Content-Type': 'application/json', 16 }, 17 body: JSON.stringify({ 18 messages: [ 19 { role: 'user', content: prompt } 20 ], 21 max_tokens: maxTokens, 22 temperature: options.temperature ?? 0.7, 23 stream: false 24 }) 25 }); 26 27 if (!response.ok) { 28 const error = await response.text(); 29 throw new Error(`HTTP ${response.status}: ${error}`); 30 } 31 32 const data = await response.json(); 33 34 if (!data.success) { 35 throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`); 36 } 37 38 return { 39 text: data.result.response, 40 usage: data.result.usage // { prompt_tokens, completion_tokens, total_tokens } 41 }; 42}

Step 2: Implement a minimal working baseline.

snippet-2.js
1// embeddings.js 2export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') { 3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 4 5 const response = await fetch(url, { 6 method: 'POST', 7 headers: { 8 'Authorization': `Bearer ${CF_API_TOKEN}`, 9 'Content-Type': 'application/json', 10 }, 11 body: JSON.stringify({ text }) 12 }); 13 14 const data = await response.json(); 15 16 if (!data.success) { 17 throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`); 18 } 19 20 return { 21 embedding: data.result.data[0], // Float32 array 22 shape: data.result.shape // [1, 768] for bge-base 23 }; 24}

Step 3: Add robust handling for non-happy paths.

snippet-3.js
1// streaming.js 2export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') { 3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 4 5 const response = await fetch(url, { 6 method: 'POST', 7 headers: { 8 'Authorization': `Bearer ${CF_API_TOKEN}`, 9 'Content-Type': 'application/json', 10 }, 11 body: JSON.stringify({ 12 messages: [{ role: 'user', content: prompt }], 13 stream: true // Enable Server-Sent Events 14 }) 15 }); 16 17 const reader = response.body.getReader(); 18 const decoder = new TextDecoder(); 19 20 while (true) { 21 const { done, value } = await reader.read(); 22 if (done) break; 23 24 const chunk = decoder.decode(value, { stream: true }); 25 const lines = chunk.split('\n').filter(line => line.trim()); 26 27 for (const line of lines) { 28 if (line.startsWith('data: ')) { 29 const json = line.slice(6); 30 if (json === '[DONE]') return; 31 32 const parsed = JSON.parse(json); 33 yield parsed.response; // Incremental text token 34 } 35 } 36 } 37}

Additional Implementation Notes

  • Step 4: Improve structure for reuse and readability.
  • Step 5: Validate with realistic usage scenarios.

Best Practices

  • Keep implementation modular and testable
  • Use one clear source of truth for configuration
  • Validate behavior before optimization

Pro Tips

  • Prefer concise code snippets with clear intent
  • Document edge cases and trade-offs
  • Use official docs for API-level decisions

Resources

Final Thoughts

Treat how to use Ai models in cloudflare through http requests as an iterative build: baseline first, then reliability and performance hardening.

Full Generated Content (Unabridged)

Only real code appears in code blocks. Other content is rendered as normal headings, lists, and text.

Blog Identity

  • title: How to Use AI Models in Cloudflare Through HTTP Requests
  • slug: cloudflare-ai-http-requests
  • primary topic keyword: Cloudflare AI Workers
  • target stack: Cloudflare Workers, JavaScript/TypeScript, REST API

SEO Metadata

  • seoTitle: Call Cloudflare AI Models via HTTP: A Practical Guide
  • metaDescription: Learn to invoke Cloudflare's AI models through direct HTTP requests. Step-by-step setup, authentication, and working code examples for production use.
  • suggestedTags: cloudflare workers, ai inference, http api, serverless, machine learning, rest api, edge computing
  • suggestedReadTime: 8 min

Hero Hook

You need AI inference without managing GPUs, containers, or cold starts. Cloudflare's AI platform promises exactly that—models running at the edge, milliseconds from your users. But the docs scatter the details across Workers bindings, REST API, and model catalogs. You just want a clean HTTP call that returns embeddings or completions.

This guide cuts through the fragmentation. You'll set up authentication, construct proper requests, handle responses, and deploy with confidence. No Workers bindings required—pure HTTP for maximum flexibility.

Context Setup

Cloudflare AI provides serverless inference for popular open models (Llama, Mistral, BERT, Whisper, SDXL) through two interfaces: Workers AI bindings (JavaScript-native) and direct REST API calls. This guide focuses on the REST API—ideal when your stack isn't Cloudflare-native, when you need to call from external services, or when you want explicit control over request/response handling.

Prerequisites:

  • Cloudflare account (free tier works)
  • API token with Cloudflare AI or Account:Read + Workers AI:Edit permissions
  • Account ID from your Cloudflare dashboard
  • curl or HTTP client of choice (we'll use fetch examples)

Problem Breakdown

Key failure points when calling Cloudflare AI via HTTP:

SymptomRoot Cause
401 UnauthorizedToken lacks Workers AI scope or wrong account ID
404 Not FoundModel name typo or unavailable in your region
400 Bad RequestMissing required fields (prompt, messages) or wrong content-type
Empty/truncated responsesMissing stream: false or mishandling chunked encoding
Unexpected latencyCalling @cf/meta/llama-2-7b-chat-int8 instead of quantized variants

Real project symptoms: Your edge function works locally but fails in production because environment variables aren't injected. Your streaming parser chokes on SSE format. Your retry logic hammers the API and triggers rate limits.

Solution Overview

Chosen approach: Direct REST API with explicit authentication

Why this over Workers bindings:

  • Portability: Same code runs in Deno, Node, Bun, or external orchestrators
  • Debugging: Full visibility into request/response headers
  • Composition: Easy to wrap with caching, retries, or fallbacks

Trade-off: You manage serialization and error handling manually. Workers bindings handle some of this automatically.

Alternative considered: Workers AI bindings—use when 100% Cloudflare-native and you want automatic request batching.

Implementation Steps

Step 1: Locate Your Credentials

Navigate to Cloudflare Dashboard → Workers & Pages → AI → Get started. Copy your Account ID from the right sidebar.

Create an API token: My Profile → API Tokens → Create Token → Use "Workers AI (Beta)" template, or custom token with:

  • Account: Cloudflare AI:Edit
  • Account: Account:Read

Store these securely—never commit to version control.

Step 2: Verify Model Availability

List available models to confirm your target exists and get the exact model name:

implementation-steps-1.sh
1curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/models/search \ 2 -H "Authorization: Bearer {API_TOKEN}"

Look for name field (e.g., @cf/meta/llama-3-8b-instruct, @cf/baai/bge-base-en-v1.5).

Step 3: Construct the Inference Request

Base URL pattern:

implementation-steps-2.text
1POST https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL_NAME}

Required headers:

  • Authorization: Bearer {TOKEN}
  • Content-Type: application/json

Step 4: Implement Text Generation

For chat/completion models, structure your payload with messages array:

implementation-steps-3.js
1const response = await fetch( 2 `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct`, 3 { 4 method: 'POST', 5 headers: { 6 'Authorization': `Bearer ${API_TOKEN}`, 7 'Content-Type': 'application/json', 8 }, 9 body: JSON.stringify({ 10 messages: [ 11 { role: 'system', content: 'You are a helpful assistant.' }, 12 { role: 'user', content: 'Explain edge computing in one sentence.' } 13 ], 14 stream: false, // Set true for SSE streaming 15 max_tokens: 256 16 }) 17 } 18);

Step 5: Handle the Response

Parse the JSON envelope. Cloudflare wraps model output in a standard response structure:

implementation-steps-4.js
1const data = await response.json(); 2 3if (!data.success) { 4 // Check data.errors for details 5 throw new Error(data.errors?.[0]?.message || 'Inference failed'); 6} 7 8const generatedText = data.result.response; // For chat models 9// OR for embeddings: data.result.data[0]

Step 6: Add Production Hardening

Wrap with timeout, retry, and circuit breaker logic appropriate to your runtime.

Code Snippets

Snippet 1: Complete Text Generation Client

code-snippet-1.js
1// ai-client.js 2const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID; 3const CF_API_TOKEN = process.env.CF_API_TOKEN; 4 5export async function generateText(prompt, options = {}) { 6 const model = options.model || '@cf/meta/llama-3-8b-instruct'; 7 const maxTokens = options.maxTokens || 512; 8 9 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 10 11 const response = await fetch(url, { 12 method: 'POST', 13 headers: { 14 'Authorization': `Bearer ${CF_API_TOKEN}`, 15 'Content-Type': 'application/json', 16 }, 17 body: JSON.stringify({ 18 messages: [ 19 { role: 'user', content: prompt } 20 ], 21 max_tokens: maxTokens, 22 temperature: options.temperature ?? 0.7, 23 stream: false 24 }) 25 }); 26 27 if (!response.ok) { 28 const error = await response.text(); 29 throw new Error(`HTTP ${response.status}: ${error}`); 30 } 31 32 const data = await response.json(); 33 34 if (!data.success) { 35 throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`); 36 } 37 38 return { 39 text: data.result.response, 40 usage: data.result.usage // { prompt_tokens, completion_tokens, total_tokens } 41 }; 42}

Snippet 2: Embedding Generation

code-snippet-2.js
1// embeddings.js 2export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') { 3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 4 5 const response = await fetch(url, { 6 method: 'POST', 7 headers: { 8 'Authorization': `Bearer ${CF_API_TOKEN}`, 9 'Content-Type': 'application/json', 10 }, 11 body: JSON.stringify({ text }) 12 }); 13 14 const data = await response.json(); 15 16 if (!data.success) { 17 throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`); 18 } 19 20 return { 21 embedding: data.result.data[0], // Float32 array 22 shape: data.result.shape // [1, 768] for bge-base 23 }; 24}

Snippet 3: Streaming Response Handler

code-snippet-3.js
1// streaming.js 2export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') { 3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`; 4 5 const response = await fetch(url, { 6 method: 'POST', 7 headers: { 8 'Authorization': `Bearer ${CF_API_TOKEN}`, 9 'Content-Type': 'application/json', 10 }, 11 body: JSON.stringify({ 12 messages: [{ role: 'user', content: prompt }], 13 stream: true // Enable Server-Sent Events 14 }) 15 }); 16 17 const reader = response.body.getReader(); 18 const decoder = new TextDecoder(); 19 20 while (true) { 21 const { done, value } = await reader.read(); 22 if (done) break; 23 24 const chunk = decoder.decode(value, { stream: true }); 25 const lines = chunk.split('\n').filter(line => line.trim()); 26 27 for (const line of lines) { 28 if (line.startsWith('data: ')) { 29 const json = line.slice(6); 30 if (json === '[DONE]') return; 31 32 const parsed = JSON.parse(json); 33 yield parsed.response; // Incremental text token 34 } 35 } 36 } 37}

Code Explanation

Critical lines decoded:

LinePurposeFailure Mode
stream: falseForces complete response vs. SSE chunksOmitting causes parser errors if you expect JSON
data.result.responseExtracts generated text from envelopeModel changes may nest this differently; always verify shape
data.result.data[0]Embedding vector for first (only) inputBatching multiple texts returns array of arrays
line.startsWith('data: ')SSE protocol parsingSome proxies buffer SSE; disable buffering or use fetch directly

What can go wrong: The streaming parser assumes well-formed SSE. If Cloudflare returns an error mid-stream (rate limit, model overload), you'll get a non-JSON line that JSON.parse throws on. Wrap the parse in try/catch and validate parsed.response exists before yielding.

Validation Checklist

  • [ ] curl list models returns 200 with model array
  • [ ] Text generation returns success: true and non-empty result.response
  • [ ] Token usage field present in response (result.usage.total_tokens)
  • [ ] Embedding vector length matches model spec (768 for bge-base, 1024 for large)
  • [ ] Streaming endpoint yields incremental tokens, terminates on [DONE]
  • [ ] 401/403 errors resolve with token scope check in dashboard
  • [ ] Response time under 2s for simple prompts (indicates edge routing works)

Edge Cases

ScenarioBehaviorMitigation
Model temporarily unavailable503 with errors.code: 1001Exponential backoff, fallback model
Prompt exceeds context window400 with context length exceededPre-tokenize, truncate with tiktoken estimate
Empty or whitespace prompt400 bad requestValidate prompt.trim().length > 0 before call
Concurrent high-volume calls429 rate limitImplement token bucket, queue, or request batching
Binary/image input for vision modelsRequires base64 encoding with data URIUse image array field, not raw binary body

Best Practices

  • Do store credentials in environment variables, never in source
  • Do validate response envelope before accessing result fields
  • Do implement request timeouts (default fetch has none in Node <18)
  • Do cache embeddings for identical inputs—computation is expensive
  • Don't retry 4xx errors blindly—fix the request first
  • Don't assume all models use messages format—embeddings use text, image gen uses prompt
  • Don't ignore usage fields—monitor costs and optimize context windows

Pro Tips

  • Batch embeddings: Send text: ["doc1", "doc2", "doc3"] to reduce overhead—returns array of vectors
  • Model routing: Maintain a priority list (e.g., llama-3-70b → llama-3-8b → mistral-7b) for graceful degradation
  • Response caching: Cache 200 responses in Cloudflare KV or R2 with TTL=3600 for identical prompts
  • Structured output: Add response_format: { type: "json_object" } (where supported) to force valid JSON without prompt engineering
  • Debug headers: Log CF-Ray ID from response headers for Cloudflare support tickets

Resources

Official Sources:

  • Cloudflare AI Documentation (https://developers.cloudflare.com/workers-ai/)
  • Cloudflare AI REST API Reference (https://developers.cloudflare.com/api/operations/workers-ai-post-run-model)
  • Cloudflare AI Models Catalog (https://developers.cloudflare.com/workers-ai/models/)
  • Cloudflare API Token Management (https://developers.cloudflare.com/fundamentals/api/get-started/create-token/)

High-Signal Community References:

  • Cloudflare Workers Examples - AI (https://github.com/cloudflare/workers-sdk/tree/main/templates) (official templates)
  • Hugging Face Inference API patterns (https://huggingface.co/docs/api-inference/index) (conceptually similar REST patterns)

Final Thoughts

Direct HTTP access to Cloudflare AI removes infrastructure friction while keeping full control. The envelope pattern (success, result, errors) is consistent across Cloudflare's API surface—learn it once, apply everywhere.

Next step: Deploy a minimal endpoint that accepts user prompts, calls Cloudflare AI with your wrapper, and returns sanitized responses. Add rate limiting and logging. Then expand to embeddings for RAG or image generation for dynamic assets.

Preview Card Data

  • previewTitle: Call Cloudflare AI Models via HTTP
  • previewDescription: Production-ready patterns for invoking Llama, Mistral, and embedding models through direct REST API calls with authentication, error handling, and streaming.
  • previewDateText: Published now
  • previewReadTime: 8 min read
  • previewTags: cloudflare, ai, rest api, serverless, edge computing

Image Plan

  • hero image idea: Abstract network diagram showing HTTP request flowing from client through Cloudflare edge to AI model inference, with response path highlighted; dark mode compatible with cyan/orange accent colors
  • inline visual 1: Code snippet screenshot of the fetch request with headers highlighted, showing Authorization bearer token pattern
  • inline visual 2: Response envelope JSON structure diagram—nested objects showing success, result, usage fields
  • inline visual 3: Architecture flowchart: User → Your API → Cloudflare AI → Model inference → Response, with retry/fallback branch
  • alt text intent: Technical diagrams emphasizing HTTP flow, authentication headers, and API response structure for accessibility

Key Concepts

  • Cloudflare Workers AI REST API
  • Bearer token authentication
  • Model inference envelope pattern
  • Server-Sent Events (SSE) streaming
  • Token usage tracking
  • Edge deployment patterns
Pro TipValidate generated TSX before writing to disk.
Next Blog