• how
  • to
  • run
  • use
  • ai

Running AI Agents in Cloudflare Workers

Amit Hariyale

Amit Hariyale

Full Stack Web Developer, Gigawave

8 min read · April 16, 2026

how to run use ai agents in cloudflare matters in real projects because weak implementation choices create hard-to-debug failures and inconsistent user experience.

This guide uses focused, production-oriented steps and code examples grounded in official references.

Key Concepts Covered

Cloudflare Workers AI Serverless GPU inference service running on Cloudflare s edge networkDurable Objects Cloudflare s strongly consistent stateful compute primitive for WorkersAI binding Runtime-injected client for Workers AI providing authenticated access without API keysTool calling Pattern where LLMs output structured function calls for external executionContext window Maximum token limit for model inputUnclear setup path for how to run use ai agents in cloudflareInconsistent implementation patternsMissing validation for edge casesKeep implementation modular and testableUse one clear source of truth for configurationValidate behavior before optimizationStep 1 Define prerequisites and expected behavior for how to run use ai agents in cloudflare.
  • Cloudflare Workers AI: Serverless GPU inference service running on Cloudflare's edge network
  • Durable Objects: Cloudflare's strongly consistent stateful compute primitive for Workers
  • AI binding: Runtime-injected client for Workers AI, providing authenticated access without API keys
  • Tool calling: Pattern where LLMs output structured function calls for external execution
  • Context window: Maximum token limit for model input
  • Unclear setup path for how to run use ai agents in cloudflare
  • Inconsistent implementation patterns
  • Missing validation for edge cases
  • Keep implementation modular and testable
  • Use one clear source of truth for configuration
  • Validate behavior before optimization
  • Step 1: Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

Context Setup

We start with minimal setup, then move to implementation patterns and validation checkpoints for how to run use ai agents in cloudflare.

Problem Breakdown

  • Unclear setup path for how to run use ai agents in cloudflare
  • Inconsistent implementation patterns
  • Missing validation for edge cases

Solution Overview

Apply a step-by-step architecture: setup, core implementation, validation, and performance checks for how to run use ai agents in cloudflare.

Step 1: Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

snippet-1.ts
1import { Ai } from '@cloudflare/ai'; 2 3 export interface Env { 4 AI: Ai; 5 } 6 7 export default { 8 async fetch(request: Request, env: Env): Promise<Response> { 9 const ai = new Ai(env.AI); 10 11 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 12 messages: [ 13 { role: 'system', content: 'You are helpful.' }, 14 { role: 'user', content: 'Hello!' } 15 ], 16 max_tokens: 100, 17 }); 18 19 return Response.json(response); 20 }, 21 };

Step 2: Implement a minimal working baseline.

snippet-2.ts
1import { Ai } from '@cloudflare/ai'; 2 3 export interface Env { 4 AI: Ai; 5 } 6 7 export class AgentSession { 8 private state: DurableObjectState; 9 private messages: Array<{role: string, content: string}> = []; 10 11 constructor(state: DurableObjectState) { 12 this.state = state; 13 } 14 15 async fetch(request: Request, env: Env): Promise<Response> { 16 const ai = new Ai(env.AI); 17 const { message } = await request.json(); 18 19 const stored = await this.state.storage.get<Array<any>>('messages'); 20 this.messages = stored || []; 21 this.messages.push({ role: 'user', content: message }); 22 23 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 24 messages: this.messages.slice(-12), 25 max_tokens: 500, 26 }); 27 28 this.messages.push({ role: 'assistant', content: response.response }); 29 await this.state.storage.put('messages', this.messages); 30 31 return Response.json({ reply: response.response, messages: this.messages.length }); 32 } 33 }

Step 3: Add robust handling for non-happy paths.

snippet-3.ts
1import { Ai } from '@cloudflare/ai'; 2 import { AgentSession } from './AgentSession'; 3 4 export interface Env { 5 AI: Ai; 6 AGENT_SESSION: DurableObjectNamespace<AgentSession>; 7 } 8 9 export { AgentSession }; 10 11 export default { 12 async fetch(request: Request, env: Env): Promise<Response> { 13 const url = new URL(request.url); 14 const sessionId = url.pathname.slice(1) || 'default'; 15 16 const id = env.AGENT_SESSION.idFromName(sessionId); 17 const session = env.AGENT_SESSION.get(id); 18 19 return session.fetch(request); 20 }, 21 };

Additional Implementation Notes

  • Step 4: Improve structure for reuse and readability.
  • Step 5: Validate with realistic usage scenarios.

Best Practices

  • Keep implementation modular and testable
  • Use one clear source of truth for configuration
  • Validate behavior before optimization

Pro Tips

  • Prefer concise code snippets with clear intent
  • Document edge cases and trade-offs
  • Use official docs for API-level decisions

Resources

Final Thoughts

Treat how to run use ai agents in cloudflare as an iterative build: baseline first, then reliability and performance hardening.

Full Generated Content (Unabridged)

Only real code appears in code blocks. Other content is rendered as normal headings, lists, and text.

Blog Identity

  • title: Running AI Agents in Cloudflare Workers
  • slug: run-ai-agents-cloudflare-workers
  • primary topic keyword: Cloudflare AI agents
  • target stack: Cloudflare Workers, AI/ML, TypeScript/JavaScript

SEO Metadata

  • seoTitle: Run AI Agents in Cloudflare Workers: A Practical Guide
  • metaDescription: Learn how to deploy and run AI agents directly on Cloudflare's edge network using Workers AI, with step-by-step implementation and real-world patterns.
  • suggestedTags: ["Cloudflare Workers", "AI agents", "edge computing", "Workers AI", "serverless AI", "LLM deployment"]
  • suggestedReadTime: 8 minutes

Hero Hook

You built a prototype AI agent that works perfectly on your laptop. Then you try to deploy it. Suddenly you're managing GPU instances, wrestling with cold starts, and watching your infrastructure bill spiral. The edge was supposed to solve this, but most platforms still force you to choose between latency and complexity.

Cloudflare Workers AI changes the calculation. You can run inference directly on their edge network—no containers to manage, no clusters to provision, and response times measured in milliseconds for users worldwide. This isn't a future promise; it's production-ready now with models from Meta, Mistral, and others available via API.

Context Setup

Cloudflare Workers AI provides serverless GPU inference at the edge. Instead of shipping model weights to your infrastructure, you call models hosted on Cloudflare's network. Your agent logic runs in a Worker (V8 isolate), which can chain multiple inference calls, maintain state via Durable Objects or KV, and respond to requests globally with minimal latency.

Prerequisites:

  • Cloudflare account with Workers AI enabled
  • Wrangler CLI installed (npm install -g wrangler)
  • Basic familiarity with TypeScript and fetch APIs
  • Understanding of LLM prompting patterns

Problem Breakdown

The deployment trap: Local AI development often uses Ollama, LM Studio, or direct API calls to OpenAI. These patterns break when you need global scale. Self-hosting requires GPU machines, model serving infrastructure, and operational expertise most teams lack.

Edge-specific failure points:

  • Cold starts on serverless GPU: Some platforms take 10-30 seconds to spin up inference containers
  • State management: Agents need memory across turns; naive implementations lose context
  • Model availability: Not all models run at the edge; you need to verify compatibility
  • Cost unpredictability: Per-token pricing varies dramatically; unbounded agent loops become expensive fast

Symptoms you'll recognize: Agents that time out, inconsistent response times across regions, ballooning infrastructure costs, or architectural complexity that slows iteration.

Solution Overview

We'll build a stateful AI agent using Cloudflare Workers AI for inference, Durable Objects for conversation memory, and a simple tool-calling pattern. This approach keeps compute at the edge, maintains sub-100ms response times for cached regions, and scales to zero when idle.

Why this over alternatives:

  • vs. self-hosted: No infrastructure management, automatic global distribution
  • vs. centralized APIs (OpenAI, Anthropic): Lower latency for distributed users, no egress costs within Cloudflare's network
  • vs. other edge platforms: Native integration with Workers ecosystem (KV, R2, D1, Durable Objects)

Implementation Steps

Step 1: Initialize Project and Configure Workers AI

Create a new Worker project and enable AI binding.

implementation-steps-1.sh
1mkdir cf-ai-agent && cd cf-ai-agent 2wrangler init --yes

Add the AI binding to wrangler.toml:

implementation-steps-2.toml
1name = "ai-agent" 2main = "src/index.ts" 3compatibility_date = "2024-06-14" 4 5[ai] 6binding = "AI"

Install dependencies:

implementation-steps-3.sh
1npm install @cloudflare/ai

Step 2: Create Basic Inference Handler

Implement a Worker that calls Workers AI with the Llama 3.1 model.

implementation-steps-4.ts
1// src/index.ts 2import { Ai } from '@cloudflare/ai'; 3 4export interface Env { 5 AI: Ai; 6} 7 8export default { 9 async fetch(request: Request, env: Env): Promise<Response> { 10 const ai = new Ai(env.AI); 11 12 const messages = [ 13 { role: 'system', content: 'You are a helpful assistant.' }, 14 { role: 'user', content: 'Explain edge computing in one sentence.' } 15 ]; 16 17 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 18 messages, 19 max_tokens: 100, 20 }); 21 22 return Response.json(response); 23 }, 24};

Deploy and test:

implementation-steps-5.sh
1wrangler deploy 2curl https://ai-agent.YOUR_SUBDOMAIN.workers.dev

Step 3: Add Stateful Conversation with Durable Objects

Create a Durable Object to maintain conversation history across requests.

implementation-steps-6.ts
1// src/AgentSession.ts 2export class AgentSession { 3 private state: DurableObjectState; 4 private messages: Array<{role: string, content: string}> = []; 5 6 constructor(state: DurableObjectState) { 7 this.state = state; 8 this.messages = state.storage.get('messages') || []; 9 } 10 11 async fetch(request: Request, env: Env): Promise<Response> { 12 const ai = new Ai(env.AI); 13 const { message } = await request.json(); 14 15 this.messages.push({ role: 'user', content: message }); 16 17 // Keep context window manageable 18 const contextWindow = this.messages.slice(-10); 19 20 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 21 messages: [ 22 { role: 'system', content: 'You are a helpful assistant with tool access.' }, 23 ...contextWindow 24 ], 25 max_tokens: 500, 26 }); 27 28 this.messages.push({ role: 'assistant', content: response.response }); 29 await this.state.storage.put('messages', this.messages); 30 31 return Response.json({ reply: response.response }); 32 } 33}

Update wrangler.toml:

implementation-steps-7.toml
1[[durable_objects.bindings]] 2name = "AGENT_SESSION" 3class_name = "AgentSession" 4 5[[migrations]] 6tag = "v1" 7new_classes = ["AgentSession"]

Step 4: Implement Tool Calling Pattern

Add structured tool use so your agent can interact with external systems.

implementation-steps-8.ts
1// src/tools.ts 2export const tools = [ 3 { 4 name: 'get_weather', 5 description: 'Get current weather for a location', 6 parameters: { 7 type: 'object', 8 properties: { 9 location: { type: 'string', description: 'City name' } 10 }, 11 required: ['location'] 12 } 13 } 14]; 15 16export async function executeTool(name: string, args: any): Promise<string> { 17 if (name === 'get_weather') { 18 // Call external weather API or cache 19 return `Weather in ${args.location}: 72°F, sunny`; 20 } 21 throw new Error(`Unknown tool: ${name}`); 22}

Integrate into the agent loop:

implementation-steps-9.ts
1// In AgentSession.fetch() 2const toolPrompt = `You have access to tools: ${JSON.stringify(tools)}. 3Respond with JSON: {"tool": "name", "args": {...}} or {"response": "..."}`; 4 5const aiResponse = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 6 messages: [ 7 { role: 'system', content: toolPrompt }, 8 ...contextWindow 9 ], 10 max_tokens: 500, 11}); 12 13// Parse and handle tool calls 14let parsed; 15try { 16 parsed = JSON.parse(aiResponse.response); 17} catch { 18 return Response.json({ reply: aiResponse.response }); 19} 20 21if (parsed.tool) { 22 const toolResult = await executeTool(parsed.tool, parsed.args); 23 this.messages.push({ role: 'assistant', content: `Tool result: ${toolResult}` }); 24 // Re-run for final response 25}

Code Snippets

Snippet 1: Basic Worker with AI Binding

  • filename: src/index.ts
  • language: typescript
  • purpose: Minimal viable AI inference in a Worker
  • code: |
code-snippet-1.ts
1import { Ai } from '@cloudflare/ai'; 2 3 export interface Env { 4 AI: Ai; 5 } 6 7 export default { 8 async fetch(request: Request, env: Env): Promise<Response> { 9 const ai = new Ai(env.AI); 10 11 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 12 messages: [ 13 { role: 'system', content: 'You are helpful.' }, 14 { role: 'user', content: 'Hello!' } 15 ], 16 max_tokens: 100, 17 }); 18 19 return Response.json(response); 20 }, 21 };

Snippet 2: Durable Object for Stateful Sessions

  • filename: src/AgentSession.ts
  • language: typescript
  • purpose: Persist conversation history across requests
  • code: |
code-snippet-2.ts
1import { Ai } from '@cloudflare/ai'; 2 3 export interface Env { 4 AI: Ai; 5 } 6 7 export class AgentSession { 8 private state: DurableObjectState; 9 private messages: Array<{role: string, content: string}> = []; 10 11 constructor(state: DurableObjectState) { 12 this.state = state; 13 } 14 15 async fetch(request: Request, env: Env): Promise<Response> { 16 const ai = new Ai(env.AI); 17 const { message } = await request.json(); 18 19 const stored = await this.state.storage.get<Array<any>>('messages'); 20 this.messages = stored || []; 21 this.messages.push({ role: 'user', content: message }); 22 23 const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', { 24 messages: this.messages.slice(-12), 25 max_tokens: 500, 26 }); 27 28 this.messages.push({ role: 'assistant', content: response.response }); 29 await this.state.storage.put('messages', this.messages); 30 31 return Response.json({ reply: response.response, messages: this.messages.length }); 32 } 33 }

Snippet 3: Wrangler Configuration

  • filename: wrangler.toml
  • language: toml
  • purpose: Complete configuration with AI binding and Durable Objects
  • code: |
code-snippet-3.toml
1name = "ai-agent" 2 main = "src/index.ts" 3 compatibility_date = "2024-06-14" 4 5 [ai] 6 binding = "AI" 7 8 [[durable_objects.bindings]] 9 name = "AGENT_SESSION" 10 class_name = "AgentSession" 11 12 [[migrations]] 13 tag = "v1" 14 new_classes = ["AgentSession"]

Snippet 4: Router to Durable Objects

  • filename: src/index.ts (updated)
  • language: typescript
  • purpose: Route requests to specific conversation sessions
  • code: |
code-snippet-4.ts
1import { Ai } from '@cloudflare/ai'; 2 import { AgentSession } from './AgentSession'; 3 4 export interface Env { 5 AI: Ai; 6 AGENT_SESSION: DurableObjectNamespace<AgentSession>; 7 } 8 9 export { AgentSession }; 10 11 export default { 12 async fetch(request: Request, env: Env): Promise<Response> { 13 const url = new URL(request.url); 14 const sessionId = url.pathname.slice(1) || 'default'; 15 16 const id = env.AGENT_SESSION.idFromName(sessionId); 17 const session = env.AGENT_SESSION.get(id); 18 19 return session.fetch(request); 20 }, 21 };

Code Explanation

Key mechanics:

  • @cf/meta/llama-3.1-8b-instruct: Cloudflare-hosted Llama 3.1 8B. Runs on their edge GPUs; you pay per 1K tokens, not per compute time. The @cf/ prefix indicates a Cloudflare-hosted model.
  • env.AI binding: Injected at runtime by Cloudflare. No API keys to manage, no external network calls for authentication. The binding routes to the nearest inference endpoint automatically.
  • DurableObjectState.storage: Transactional key-value storage tied to the Durable Object instance. Survives Worker restarts, maintains consistency across concurrent requests to the same session.
  • idFromName(sessionId): Deterministic ID generation. Same session ID always routes to the same Durable Object instance, giving you sticky sessions without load balancer configuration.

What can go wrong:

  • Storage limits: Durable Objects have 1GB storage per instance. Unbounded conversation history will eventually fail. The slice(-12) in Snippet 2 is load-bearing—remove it and long conversations crash.
  • JSON parsing failures: LLMs don't always return valid JSON for tool calls. The try/catch in Step 4 is essential; without it, malformed responses throw unhandled exceptions.
  • Cold starts on Durable Objects: First request to a new session incurs ~50-100ms initialization. Design for this—don't assume sub-10ms for the first message in a new conversation.

Validation Checklist

  • [ ] wrangler deploy completes without errors
  • [ ] curl https://your-worker.workers.dev returns valid JSON with response or reply field
  • [ ] Second request to same session ID returns faster (Durable Object warm)
  • [ ] Conversation history persists across 5+ sequential requests
  • [ ] Token usage visible in Cloudflare dashboard under Workers AI
  • [ ] 12+ message history still functions (context window management)
  • [ ] Invalid JSON from model is handled gracefully (no 500 errors)

Expected behavior: Sub-200ms response times for cached regions, automatic failover to available inference nodes, zero-downtime deployments on wrangler deploy.

Edge Cases

Long-running agent loops: If your agent chains multiple tool calls, you may hit Worker execution limits (50ms CPU time on free tier, 30s wall clock on paid). Use waitUntil for background processing or break chains into multiple requests.

Concurrent session access: Durable Objects process one request at a time per instance. High-frequency updates to the same session serialize automatically—design for this or shard sessions.

Model unavailability: Cloudflare may rate-limit or temporarily remove models. Implement fallback logic:

edge-cases-1.ts
1const models = ['@cf/meta/llama-3.1-8b-instruct', '@cf/mistral/mistral-7b-instruct-v0.2']; 2for (const model of models) { 3 try { 4 return await ai.run(model, params); 5 } catch (e) { 6 if (e.message.includes('model')) continue; 7 throw e; 8 } 9}

Token overflow: Models have fixed context windows. Exceeding them throws errors. Always truncate or summarize history before sending.

Best Practices

  • Do truncate conversation history to fit model context windows; don't assume infinite memory
  • Do use Durable Object alarms for scheduled agent tasks, not persistent polling
  • Do cache tool results in KV or R2 to avoid redundant external API calls
  • Don't store sensitive data in Durable Object state without encryption; storage is durable but not encrypted at rest by default
  • Don't call Workers AI from the client side directly; always route through your Worker for rate limiting and prompt injection protection
  • Do monitor token usage in Cloudflare dashboard; set up alerts for unexpected spikes
  • Don't rely on specific model versions; Cloudflare updates @cf/ models; pin versions via custom model uploads if stability is critical

Pro Tips

  • Streaming responses: Use stream: true in ai.run() for real-time agent output. Reduces perceived latency significantly for long responses.
  • Structured output without tool calls: Add response_format: { type: 'json_object' } (where supported) to force valid JSON without manual parsing.
  • Batch inference: For non-interactive agents, queue multiple prompts and process with Promise.all—Workers AI handles concurrency well.
  • Custom models: Upload fine-tuned weights to Cloudflare's model catalog via wrangler ai model upload for specialized agents.
  • Cost optimization: Use smaller models (@cf/meta/llama-3.1-8b-instruct vs. 70B) for routing/classification, reserve large models for final generation.

Resources

Official Sources:

  • Cloudflare Workers AI Documentation (https://developers.cloudflare.com/workers-ai/)
  • Cloudflare Durable Objects Documentation (https://developers.cloudflare.com/durable-objects/)
  • Workers AI Model Catalog (https://developers.cloudflare.com/workers-ai/models/)
  • Wrangler CLI Reference (https://developers.cloudflare.com/workers/wrangler/commands/)
  • Cloudflare AI REST API (https://developers.cloudflare.com/api/operations/workers-ai-post-run-model)

High-Signal Community References:

  • Cloudflare Community Forum - Workers AI (https://community.cloudflare.com/c/developers/workers-ai/85)
  • Cloudflare Blog: Workers AI Launch (https://blog.cloudflare.com/workers-ai/)

Final Thoughts

Running AI agents at the edge removes the infrastructure barrier that slows most teams down. You get global distribution, automatic scaling, and sub-100ms response times without managing a single GPU. The tradeoff is less control over model versions and some constraints on execution time—constraints that force better architecture.

Start with a simple stateless agent, add Durable Objects when you need memory, then layer in tool calling. Measure token costs early; they're predictable but not zero. The platform is mature enough for production—what matters now is your agent's logic, not your infrastructure.

Next step: Deploy the code from Step 2, then modify it to call a real API via tool calling. Ship it.

Preview Card Data

  • previewTitle: Run AI Agents on Cloudflare's Edge
  • previewDescription: Deploy stateful AI agents with Workers AI and Durable Objects. No GPU management, global scale, sub-100ms responses.
  • previewDateText: Technical Guide
  • previewReadTime: 8 min read
  • previewTags: ["Cloudflare", "AI", "Edge Computing", "Serverless"]

Image Plan

  • hero image idea: Abstract visualization of neural network nodes distributed across a world map, with connection lines converging on Cloudflare's orange cloud logo at center. Dark background with glowing orange and blue accents.
  • inline visual 1: Architecture diagram showing request flow—user → Cloudflare edge → Worker → Durable Object → Workers AI inference → response. Label latency targets at each hop.
  • inline visual 2: Code snippet screenshot with syntax highlighting for the tool-calling JSON parsing section, showing error handling pattern.
  • inline visual 3: Cloudflare dashboard screenshot mockup showing Workers AI token usage graph and Durable Object instance count.
  • alt text intent: All images emphasize "edge distribution" and "low latency" themes; avoid generic AI robot imagery.

Key Concepts

  • Cloudflare Workers AI: Serverless GPU inference service running on Cloudflare's edge network
  • Durable Objects: Cloudflare's strongly consistent stateful compute primitive for Workers
  • AI binding: Runtime-injected client for Workers AI, providing authenticated access without API keys
  • Tool calling: Pattern where LLMs output structured function calls for external execution
  • Context window: Maximum token limit for model input
Pro TipFor how to run use ai agents in cloudflare, verify installation, run a real-world validation, and document rollback steps before production.
Next Blog