how to use Ai models in cloudflare through http requests

how
to
use
Ai
models

Amit Hariyale

Full Stack Web Developer, Gigawave

8 min read · June 14, 2025

how to use Ai models in cloudflare through http requests matters in real projects because weak implementation choices create hard-to-debug failures and inconsistent user experience.

This guide uses focused, production-oriented steps and code examples grounded in official references.

Key Concepts Covered

modelscloudflarethroughhttprequests

Cloudflare Workers AI REST API
Bearer token authentication
Model inference envelope pattern
Server-Sent Events (SSE) streaming
Token usage tracking
Edge deployment patterns

Context Setup

We start with minimal setup, then move to implementation patterns and validation checkpoints for how to use Ai models in cloudflare through http requests.

Problem Breakdown

Unclear setup path for how to use Ai models in cloudflare through http requests
Inconsistent implementation patterns
Missing validation for edge cases

Solution Overview

Apply a step-by-step architecture: setup, core implementation, validation, and performance checks for how to use Ai models in cloudflare through http requests.

Step 1: Define prerequisites and expected behavior for how to use Ai models in cloudflare through http requests.

snippet-1.js

// ai-client.js
const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID;
const CF_API_TOKEN = process.env.CF_API_TOKEN;

export async function generateText(prompt, options = {}) {
  const model = options.model || '@cf/meta/llama-3-8b-instruct';
  const maxTokens = options.maxTokens || 512;
  
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [
        { role: 'user', content: prompt }
      ],
      max_tokens: maxTokens,
      temperature: options.temperature ?? 0.7,
      stream: false
    })
  });

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`HTTP ${response.status}: ${error}`);
  }

  const data = await response.json();
  
  if (!data.success) {
    throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`);
  }

  return {
    text: data.result.response,
    usage: data.result.usage  // { prompt_tokens, completion_tokens, total_tokens }
  };
}

Step 2: Implement a minimal working baseline.

snippet-2.js

// embeddings.js
export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') {
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ text })
  });

  const data = await response.json();
  
  if (!data.success) {
    throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`);
  }

  return {
    embedding: data.result.data[0],  // Float32 array
    shape: data.result.shape         // [1, 768] for bge-base
  };
}

Step 3: Add robust handling for non-happy paths.

snippet-3.js

// streaming.js
export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') {
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
      stream: true  // Enable Server-Sent Events
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    
    const chunk = decoder.decode(value, { stream: true });
    const lines = chunk.split('\n').filter(line => line.trim());
    
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const json = line.slice(6);
        if (json === '[DONE]') return;
        
        const parsed = JSON.parse(json);
        yield parsed.response;  // Incremental text token
      }
    }
  }
}

Additional Implementation Notes

Step 4: Improve structure for reuse and readability.
Step 5: Validate with realistic usage scenarios.

Best Practices

Keep implementation modular and testable
Use one clear source of truth for configuration
Validate behavior before optimization

Pro Tips

Prefer concise code snippets with clear intent
Document edge cases and trade-offs
Use official docs for API-level decisions

Resources

Official Docs

Final Thoughts

Treat how to use Ai models in cloudflare through http requests as an iterative build: baseline first, then reliability and performance hardening.

Full Generated Content (Unabridged)

Only real code appears in code blocks. Other content is rendered as normal headings, lists, and text.

Blog Identity

title: How to Use AI Models in Cloudflare Through HTTP Requests
slug: cloudflare-ai-http-requests
primary topic keyword: Cloudflare AI Workers
target stack: Cloudflare Workers, JavaScript/TypeScript, REST API

SEO Metadata

seoTitle: Call Cloudflare AI Models via HTTP: A Practical Guide
metaDescription: Learn to invoke Cloudflare's AI models through direct HTTP requests. Step-by-step setup, authentication, and working code examples for production use.
suggestedTags: cloudflare workers, ai inference, http api, serverless, machine learning, rest api, edge computing
suggestedReadTime: 8 min

Hero Hook

You need AI inference without managing GPUs, containers, or cold starts. Cloudflare's AI platform promises exactly that—models running at the edge, milliseconds from your users. But the docs scatter the details across Workers bindings, REST API, and model catalogs. You just want a clean HTTP call that returns embeddings or completions.

This guide cuts through the fragmentation. You'll set up authentication, construct proper requests, handle responses, and deploy with confidence. No Workers bindings required—pure HTTP for maximum flexibility.

Context Setup

Cloudflare AI provides serverless inference for popular open models (Llama, Mistral, BERT, Whisper, SDXL) through two interfaces: Workers AI bindings (JavaScript-native) and direct REST API calls. This guide focuses on the REST API—ideal when your stack isn't Cloudflare-native, when you need to call from external services, or when you want explicit control over request/response handling.

Prerequisites:

Cloudflare account (free tier works)
API token with Cloudflare AI or Account:Read + Workers AI:Edit permissions
Account ID from your Cloudflare dashboard
curl or HTTP client of choice (we'll use fetch examples)

Problem Breakdown

Key failure points when calling Cloudflare AI via HTTP:

Symptom	Root Cause
401 Unauthorized	Token lacks Workers AI scope or wrong account ID
404 Not Found	Model name typo or unavailable in your region
400 Bad Request	Missing required fields (prompt, messages) or wrong content-type
Empty/truncated responses	Missing stream: false or mishandling chunked encoding
Unexpected latency	Calling @cf/meta/llama-2-7b-chat-int8 instead of quantized variants

Real project symptoms: Your edge function works locally but fails in production because environment variables aren't injected. Your streaming parser chokes on SSE format. Your retry logic hammers the API and triggers rate limits.

Solution Overview

Chosen approach: Direct REST API with explicit authentication

Why this over Workers bindings:

Portability: Same code runs in Deno, Node, Bun, or external orchestrators
Debugging: Full visibility into request/response headers
Composition: Easy to wrap with caching, retries, or fallbacks

Trade-off: You manage serialization and error handling manually. Workers bindings handle some of this automatically.

Alternative considered: Workers AI bindings—use when 100% Cloudflare-native and you want automatic request batching.

Implementation Steps

Step 1: Locate Your Credentials

Navigate to Cloudflare Dashboard → Workers & Pages → AI → Get started. Copy your Account ID from the right sidebar.

Create an API token: My Profile → API Tokens → Create Token → Use "Workers AI (Beta)" template, or custom token with:

Account: Cloudflare AI:Edit
Account: Account:Read

Store these securely—never commit to version control.

Step 2: Verify Model Availability

List available models to confirm your target exists and get the exact model name:

implementation-steps-1.sh

curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/models/search \
  -H "Authorization: Bearer {API_TOKEN}"

Look for name field (e.g., @cf/meta/llama-3-8b-instruct, @cf/baai/bge-base-en-v1.5).

Step 3: Construct the Inference Request

Base URL pattern:

implementation-steps-2.text

POST https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL_NAME}

Required headers:

Authorization: Bearer {TOKEN}
Content-Type: application/json

Step 4: Implement Text Generation

For chat/completion models, structure your payload with messages array:

implementation-steps-3.js

const response = await fetch(
  `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'Explain edge computing in one sentence.' }
      ],
      stream: false,  // Set true for SSE streaming
      max_tokens: 256
    })
  }
);

Step 5: Handle the Response

Parse the JSON envelope. Cloudflare wraps model output in a standard response structure:

implementation-steps-4.js

const data = await response.json();

if (!data.success) {
  // Check data.errors for details
  throw new Error(data.errors?.[0]?.message || 'Inference failed');
}

const generatedText = data.result.response;  // For chat models
// OR for embeddings: data.result.data[0]

Step 6: Add Production Hardening

Wrap with timeout, retry, and circuit breaker logic appropriate to your runtime.

Code Snippets

Snippet 1: Complete Text Generation Client

code-snippet-1.js

// ai-client.js
const CF_ACCOUNT_ID = process.env.CF_ACCOUNT_ID;
const CF_API_TOKEN = process.env.CF_API_TOKEN;

export async function generateText(prompt, options = {}) {
  const model = options.model || '@cf/meta/llama-3-8b-instruct';
  const maxTokens = options.maxTokens || 512;
  
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [
        { role: 'user', content: prompt }
      ],
      max_tokens: maxTokens,
      temperature: options.temperature ?? 0.7,
      stream: false
    })
  });

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`HTTP ${response.status}: ${error}`);
  }

  const data = await response.json();
  
  if (!data.success) {
    throw new Error(`API error: ${data.errors?.map(e => e.message).join(', ')}`);
  }

  return {
    text: data.result.response,
    usage: data.result.usage  // { prompt_tokens, completion_tokens, total_tokens }
  };
}

Snippet 2: Embedding Generation

code-snippet-2.js

// embeddings.js
export async function createEmbedding(text, model = '@cf/baai/bge-base-en-v1.5') {
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ text })
  });

  const data = await response.json();
  
  if (!data.success) {
    throw new Error(`Embedding failed: ${JSON.stringify(data.errors)}`);
  }

  return {
    embedding: data.result.data[0],  // Float32 array
    shape: data.result.shape         // [1, 768] for bge-base
  };
}

Snippet 3: Streaming Response Handler

code-snippet-3.js

// streaming.js
export async function* streamText(prompt, model = '@cf/meta/llama-3-8b-instruct') {
  const url = `https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/ai/run/${model}`;
  
  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${CF_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
      stream: true  // Enable Server-Sent Events
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    
    const chunk = decoder.decode(value, { stream: true });
    const lines = chunk.split('\n').filter(line => line.trim());
    
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const json = line.slice(6);
        if (json === '[DONE]') return;
        
        const parsed = JSON.parse(json);
        yield parsed.response;  // Incremental text token
      }
    }
  }
}

Code Explanation

Critical lines decoded:

Line	Purpose	Failure Mode
stream: false	Forces complete response vs. SSE chunks	Omitting causes parser errors if you expect JSON
data.result.response	Extracts generated text from envelope	Model changes may nest this differently; always verify shape
data.result.data[0]	Embedding vector for first (only) input	Batching multiple texts returns array of arrays
line.startsWith('data: ')	SSE protocol parsing	Some proxies buffer SSE; disable buffering or use fetch directly

What can go wrong: The streaming parser assumes well-formed SSE. If Cloudflare returns an error mid-stream (rate limit, model overload), you'll get a non-JSON line that JSON.parse throws on. Wrap the parse in try/catch and validate parsed.response exists before yielding.

Validation Checklist

[ ] curl list models returns 200 with model array
[ ] Text generation returns success: true and non-empty result.response
[ ] Token usage field present in response (result.usage.total_tokens)
[ ] Embedding vector length matches model spec (768 for bge-base, 1024 for large)
[ ] Streaming endpoint yields incremental tokens, terminates on [DONE]
[ ] 401/403 errors resolve with token scope check in dashboard
[ ] Response time under 2s for simple prompts (indicates edge routing works)

Edge Cases

Scenario	Behavior	Mitigation
Model temporarily unavailable	503 with errors.code: 1001	Exponential backoff, fallback model
Prompt exceeds context window	400 with context length exceeded	Pre-tokenize, truncate with tiktoken estimate
Empty or whitespace prompt	400 bad request	Validate prompt.trim().length > 0 before call
Concurrent high-volume calls	429 rate limit	Implement token bucket, queue, or request batching
Binary/image input for vision models	Requires base64 encoding with data URI	Use image array field, not raw binary body

Best Practices

Do store credentials in environment variables, never in source
Do validate response envelope before accessing result fields
Do implement request timeouts (default fetch has none in Node <18)
Do cache embeddings for identical inputs—computation is expensive
Don't retry 4xx errors blindly—fix the request first
Don't assume all models use messages format—embeddings use text, image gen uses prompt
Don't ignore usage fields—monitor costs and optimize context windows

Pro Tips

Batch embeddings: Send text: ["doc1", "doc2", "doc3"] to reduce overhead—returns array of vectors
Model routing: Maintain a priority list (e.g., llama-3-70b → llama-3-8b → mistral-7b) for graceful degradation
Response caching: Cache 200 responses in Cloudflare KV or R2 with TTL=3600 for identical prompts
Structured output: Add response_format: { type: "json_object" } (where supported) to force valid JSON without prompt engineering
Debug headers: Log CF-Ray ID from response headers for Cloudflare support tickets

Resources

Official Sources:

Cloudflare AI Documentation (https://developers.cloudflare.com/workers-ai/)
Cloudflare AI REST API Reference (https://developers.cloudflare.com/api/operations/workers-ai-post-run-model)
Cloudflare AI Models Catalog (https://developers.cloudflare.com/workers-ai/models/)
Cloudflare API Token Management (https://developers.cloudflare.com/fundamentals/api/get-started/create-token/)

High-Signal Community References:

Cloudflare Workers Examples - AI (https://github.com/cloudflare/workers-sdk/tree/main/templates) (official templates)
Hugging Face Inference API patterns (https://huggingface.co/docs/api-inference/index) (conceptually similar REST patterns)

Final Thoughts

Direct HTTP access to Cloudflare AI removes infrastructure friction while keeping full control. The envelope pattern (success, result, errors) is consistent across Cloudflare's API surface—learn it once, apply everywhere.

Next step: Deploy a minimal endpoint that accepts user prompts, calls Cloudflare AI with your wrapper, and returns sanitized responses. Add rate limiting and logging. Then expand to embeddings for RAG or image generation for dynamic assets.

Preview Card Data

previewTitle: Call Cloudflare AI Models via HTTP
previewDescription: Production-ready patterns for invoking Llama, Mistral, and embedding models through direct REST API calls with authentication, error handling, and streaming.
previewDateText: Published now
previewReadTime: 8 min read
previewTags: cloudflare, ai, rest api, serverless, edge computing

Image Plan

hero image idea: Abstract network diagram showing HTTP request flowing from client through Cloudflare edge to AI model inference, with response path highlighted; dark mode compatible with cyan/orange accent colors
inline visual 1: Code snippet screenshot of the fetch request with headers highlighted, showing Authorization bearer token pattern
inline visual 2: Response envelope JSON structure diagram—nested objects showing success, result, usage fields
inline visual 3: Architecture flowchart: User → Your API → Cloudflare AI → Model inference → Response, with retry/fallback branch
alt text intent: Technical diagrams emphasizing HTTP flow, authentication headers, and API response structure for accessibility

Key Concepts

Cloudflare Workers AI REST API
Bearer token authentication
Model inference envelope pattern
Server-Sent Events (SSE) streaming
Token usage tracking
Edge deployment patterns

Pro TipValidate generated TSX before writing to disk.

Next Blog