
Amit Hariyale
Full Stack Web Developer, Gigawave

Full Stack Web Developer, Gigawave
how to use Ai models in cloudflare through http requests matters in real projects because weak implementation choices create hard-to-debug failures and inconsistent user experience.
This guide uses focused, production-oriented steps and code examples grounded in official references.
modelscloudflarethroughhttprequestsWe start with minimal setup, then move to implementation patterns and validation checkpoints for how to use Ai models in cloudflare through http requests.
Apply a step-by-step architecture: setup, core implementation, validation, and performance checks for how to use Ai models in cloudflare through http requests.
1// ai-client.js
2const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID;
3const CF_API_TOKEN = process.env.CF_API_TOKEN;
4
5export async function generateText(prompt, options = {}) {
6 const model = options.model || '@cf/meta/llama-3-8b-instruct';
7 const maxTokens = options.maxTokens || 512;
8
9 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
10
11 const response = await fetch(url, {
12 method: 'POST',
13 headers: {
14 'Authorization': `Bearer ${CF_API_TOKEN}`,
15 'Content-Type': 'application/json',
16 },
17 body: JSON.stringify({
18 messages: [
19 { role: 'user', content: prompt }
20 ],
21 max_tokens: maxTokens,
22 temperature: options.temperature ?? 0.7,
23 stream: false
24 })
25 });
26
27 if (!response.ok) {
28 const error = await response.text();
29 throw new Error(`HTTP ${response.status}: ${error}`);
30 }
31
32 const data = await response.json();
33
34 if (!data.success) {
35 throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`);
36 }
37
38 return {
39 text: data.result.response,
40 usage: data.result.usage // { prompt_tokens, completion_tokens, total_tokens }
41 };
42}1// embeddings.js
2export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') {
3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
4
5 const response = await fetch(url, {
6 method: 'POST',
7 headers: {
8 'Authorization': `Bearer ${CF_API_TOKEN}`,
9 'Content-Type': 'application/json',
10 },
11 body: JSON.stringify({ text })
12 });
13
14 const data = await response.json();
15
16 if (!data.success) {
17 throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`);
18 }
19
20 return {
21 embedding: data.result.data[0], // Float32 array
22 shape: data.result.shape // [1, 768] for bge-base
23 };
24}1// streaming.js
2export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') {
3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
4
5 const response = await fetch(url, {
6 method: 'POST',
7 headers: {
8 'Authorization': `Bearer ${CF_API_TOKEN}`,
9 'Content-Type': 'application/json',
10 },
11 body: JSON.stringify({
12 messages: [{ role: 'user', content: prompt }],
13 stream: true // Enable Server-Sent Events
14 })
15 });
16
17 const reader = response.body.getReader();
18 const decoder = new TextDecoder();
19
20 while (true) {
21 const { done, value } = await reader.read();
22 if (done) break;
23
24 const chunk = decoder.decode(value, { stream: true });
25 const lines = chunk.split('\n').filter(line => line.trim());
26
27 for (const line of lines) {
28 if (line.startsWith('data: ')) {
29 const json = line.slice(6);
30 if (json === '[DONE]') return;
31
32 const parsed = JSON.parse(json);
33 yield parsed.response; // Incremental text token
34 }
35 }
36 }
37}Treat how to use Ai models in cloudflare through http requests as an iterative build: baseline first, then reliability and performance hardening.
Only real code appears in code blocks. Other content is rendered as normal headings, lists, and text.
You need AI inference without managing GPUs, containers, or cold starts. Cloudflare's AI platform promises exactly that—models running at the edge, milliseconds from your users. But the docs scatter the details across Workers bindings, REST API, and model catalogs. You just want a clean HTTP call that returns embeddings or completions.
This guide cuts through the fragmentation. You'll set up authentication, construct proper requests, handle responses, and deploy with confidence. No Workers bindings required—pure HTTP for maximum flexibility.
Cloudflare AI provides serverless inference for popular open models (Llama, Mistral, BERT, Whisper, SDXL) through two interfaces: Workers AI bindings (JavaScript-native) and direct REST API calls. This guide focuses on the REST API—ideal when your stack isn't Cloudflare-native, when you need to call from external services, or when you want explicit control over request/response handling.
Prerequisites:
Key failure points when calling Cloudflare AI via HTTP:
| Symptom | Root Cause |
|---|---|
| 401 Unauthorized | Token lacks Workers AI scope or wrong account ID |
| 404 Not Found | Model name typo or unavailable in your region |
| 400 Bad Request | Missing required fields (prompt, messages) or wrong content-type |
| Empty/truncated responses | Missing stream: false or mishandling chunked encoding |
| Unexpected latency | Calling @cf/meta/llama-2-7b-chat-int8 instead of quantized variants |
Real project symptoms: Your edge function works locally but fails in production because environment variables aren't injected. Your streaming parser chokes on SSE format. Your retry logic hammers the API and triggers rate limits.
Chosen approach: Direct REST API with explicit authentication
Why this over Workers bindings:
Trade-off: You manage serialization and error handling manually. Workers bindings handle some of this automatically.
Alternative considered: Workers AI bindings—use when 100% Cloudflare-native and you want automatic request batching.
Navigate to Cloudflare Dashboard → Workers & Pages → AI → Get started. Copy your Account ID from the right sidebar.
Create an API token: My Profile → API Tokens → Create Token → Use "Workers AI (Beta)" template, or custom token with:
Store these securely—never commit to version control.
List available models to confirm your target exists and get the exact model name:
1curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/models/search \
2 -H "Authorization: Bearer {API_TOKEN}"Look for name field (e.g., @cf/meta/llama-3-8b-instruct, @cf/baai/bge-base-en-v1.5).
Base URL pattern:
1POST https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL_NAME}Required headers:
For chat/completion models, structure your payload with messages array:
1const response = await fetch(
2 `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct`,
3 {
4 method: 'POST',
5 headers: {
6 'Authorization': `Bearer ${API_TOKEN}`,
7 'Content-Type': 'application/json',
8 },
9 body: JSON.stringify({
10 messages: [
11 { role: 'system', content: 'You are a helpful assistant.' },
12 { role: 'user', content: 'Explain edge computing in one sentence.' }
13 ],
14 stream: false, // Set true for SSE streaming
15 max_tokens: 256
16 })
17 }
18);Parse the JSON envelope. Cloudflare wraps model output in a standard response structure:
1const data = await response.json();
2
3if (!data.success) {
4 // Check data.errors for details
5 throw new Error(data.errors?.[0]?.message || 'Inference failed');
6}
7
8const generatedText = data.result.response; // For chat models
9// OR for embeddings: data.result.data[0]Wrap with timeout, retry, and circuit breaker logic appropriate to your runtime.
Snippet 1: Complete Text Generation Client
1// ai-client.js
2const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID;
3const CF_API_TOKEN = process.env.CF_API_TOKEN;
4
5export async function generateText(prompt, options = {}) {
6 const model = options.model || '@cf/meta/llama-3-8b-instruct';
7 const maxTokens = options.maxTokens || 512;
8
9 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
10
11 const response = await fetch(url, {
12 method: 'POST',
13 headers: {
14 'Authorization': `Bearer ${CF_API_TOKEN}`,
15 'Content-Type': 'application/json',
16 },
17 body: JSON.stringify({
18 messages: [
19 { role: 'user', content: prompt }
20 ],
21 max_tokens: maxTokens,
22 temperature: options.temperature ?? 0.7,
23 stream: false
24 })
25 });
26
27 if (!response.ok) {
28 const error = await response.text();
29 throw new Error(`HTTP ${response.status}: ${error}`);
30 }
31
32 const data = await response.json();
33
34 if (!data.success) {
35 throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`);
36 }
37
38 return {
39 text: data.result.response,
40 usage: data.result.usage // { prompt_tokens, completion_tokens, total_tokens }
41 };
42}Snippet 2: Embedding Generation
1// embeddings.js
2export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') {
3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
4
5 const response = await fetch(url, {
6 method: 'POST',
7 headers: {
8 'Authorization': `Bearer ${CF_API_TOKEN}`,
9 'Content-Type': 'application/json',
10 },
11 body: JSON.stringify({ text })
12 });
13
14 const data = await response.json();
15
16 if (!data.success) {
17 throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`);
18 }
19
20 return {
21 embedding: data.result.data[0], // Float32 array
22 shape: data.result.shape // [1, 768] for bge-base
23 };
24}Snippet 3: Streaming Response Handler
1// streaming.js
2export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') {
3 const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
4
5 const response = await fetch(url, {
6 method: 'POST',
7 headers: {
8 'Authorization': `Bearer ${CF_API_TOKEN}`,
9 'Content-Type': 'application/json',
10 },
11 body: JSON.stringify({
12 messages: [{ role: 'user', content: prompt }],
13 stream: true // Enable Server-Sent Events
14 })
15 });
16
17 const reader = response.body.getReader();
18 const decoder = new TextDecoder();
19
20 while (true) {
21 const { done, value } = await reader.read();
22 if (done) break;
23
24 const chunk = decoder.decode(value, { stream: true });
25 const lines = chunk.split('\n').filter(line => line.trim());
26
27 for (const line of lines) {
28 if (line.startsWith('data: ')) {
29 const json = line.slice(6);
30 if (json === '[DONE]') return;
31
32 const parsed = JSON.parse(json);
33 yield parsed.response; // Incremental text token
34 }
35 }
36 }
37}Critical lines decoded:
| Line | Purpose | Failure Mode |
|---|---|---|
| stream: false | Forces complete response vs. SSE chunks | Omitting causes parser errors if you expect JSON |
| data.result.response | Extracts generated text from envelope | Model changes may nest this differently; always verify shape |
| data.result.data[0] | Embedding vector for first (only) input | Batching multiple texts returns array of arrays |
| line.startsWith('data: ') | SSE protocol parsing | Some proxies buffer SSE; disable buffering or use fetch directly |
What can go wrong: The streaming parser assumes well-formed SSE. If Cloudflare returns an error mid-stream (rate limit, model overload), you'll get a non-JSON line that JSON.parse throws on. Wrap the parse in try/catch and validate parsed.response exists before yielding.
| Scenario | Behavior | Mitigation |
|---|---|---|
| Model temporarily unavailable | 503 with errors.code: 1001 | Exponential backoff, fallback model |
| Prompt exceeds context window | 400 with context length exceeded | Pre-tokenize, truncate with tiktoken estimate |
| Empty or whitespace prompt | 400 bad request | Validate prompt.trim().length > 0 before call |
| Concurrent high-volume calls | 429 rate limit | Implement token bucket, queue, or request batching |
| Binary/image input for vision models | Requires base64 encoding with data URI | Use image array field, not raw binary body |
Official Sources:
High-Signal Community References:
Direct HTTP access to Cloudflare AI removes infrastructure friction while keeping full control. The envelope pattern (success, result, errors) is consistent across Cloudflare's API surface—learn it once, apply everywhere.
Next step: Deploy a minimal endpoint that accepts user prompts, calls Cloudflare AI with your wrapper, and returns sanitized responses. Add rate limiting and logging. Then expand to embeddings for RAG or image generation for dynamic assets.