how to run use ai agents in cloudflare - Gigawave

Running AI Agents in Cloudflare Workers

Amit Hariyale

Full Stack Web Developer, Gigawave

8 min read · April 16, 2026

how to run use ai agents in cloudflare matters in real projects because weak implementation choices create hard-to-debug failures and inconsistent user experience.

This guide uses focused, production-oriented steps and code examples grounded in official references.

Key Concepts Covered

Cloudflare Workers AI Serverless GPU inference service running on Cloudflare s edge networkDurable Objects Cloudflare s strongly consistent stateful compute primitive for WorkersAI binding Runtime-injected client for Workers AI providing authenticated access without API keysTool calling Pattern where LLMs output structured function calls for external executionContext window Maximum token limit for model inputUnclear setup path for how to run use ai agents in cloudflareInconsistent implementation patternsMissing validation for edge casesKeep implementation modular and testableUse one clear source of truth for configurationValidate behavior before optimizationStep 1 Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

Cloudflare Workers AI: Serverless GPU inference service running on Cloudflare's edge network
Durable Objects: Cloudflare's strongly consistent stateful compute primitive for Workers
AI binding: Runtime-injected client for Workers AI, providing authenticated access without API keys
Tool calling: Pattern where LLMs output structured function calls for external execution
Context window: Maximum token limit for model input
Unclear setup path for how to run use ai agents in cloudflare
Inconsistent implementation patterns
Missing validation for edge cases
Keep implementation modular and testable
Use one clear source of truth for configuration
Validate behavior before optimization
Step 1: Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

Context Setup

We start with minimal setup, then move to implementation patterns and validation checkpoints for how to run use ai agents in cloudflare.

Problem Breakdown

Unclear setup path for how to run use ai agents in cloudflare
Inconsistent implementation patterns
Missing validation for edge cases

Solution Overview

Apply a step-by-step architecture: setup, core implementation, validation, and performance checks for how to run use ai agents in cloudflare.

Step 1: Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

snippet-1.ts

import { Ai } from '@cloudflare/ai';

  export interface Env {
    AI: Ai;
  }

  export default {
    async fetch(request: Request, env: Env): Promise<Response> {
      const ai = new Ai(env.AI);
      
      const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
        messages: [
          { role: 'system', content: 'You are helpful.' },
          { role: 'user', content: 'Hello!' }
        ],
        max_tokens: 100,
      });

      return Response.json(response);
    },
  };

Step 2: Implement a minimal working baseline.

snippet-2.ts

import { Ai } from '@cloudflare/ai';

  export interface Env {
    AI: Ai;
  }

  export class AgentSession {
    private state: DurableObjectState;
    private messages: Array<{role: string, content: string}> = [];

    constructor(state: DurableObjectState) {
      this.state = state;
    }

    async fetch(request: Request, env: Env): Promise<Response> {
      const ai = new Ai(env.AI);
      const { message } = await request.json();

      const stored = await this.state.storage.get<Array<any>>('messages');
      this.messages = stored || [];
      this.messages.push({ role: 'user', content: message });

      const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
        messages: this.messages.slice(-12),
        max_tokens: 500,
      });

      this.messages.push({ role: 'assistant', content: response.response });
      await this.state.storage.put('messages', this.messages);

      return Response.json({ reply: response.response, messages: this.messages.length });
    }
  }

Step 3: Add robust handling for non-happy paths.

snippet-3.ts

import { Ai } from '@cloudflare/ai';
  import { AgentSession } from './AgentSession';

  export interface Env {
    AI: Ai;
    AGENT_SESSION: DurableObjectNamespace<AgentSession>;
  }

  export { AgentSession };

  export default {
    async fetch(request: Request, env: Env): Promise<Response> {
      const url = new URL(request.url);
      const sessionId = url.pathname.slice(1) || 'default';
      
      const id = env.AGENT_SESSION.idFromName(sessionId);
      const session = env.AGENT_SESSION.get(id);
      
      return session.fetch(request);
    },
  };

Additional Implementation Notes

Step 4: Improve structure for reuse and readability.
Step 5: Validate with realistic usage scenarios.

Best Practices

Keep implementation modular and testable
Use one clear source of truth for configuration
Validate behavior before optimization

Pro Tips

Prefer concise code snippets with clear intent
Document edge cases and trade-offs
Use official docs for API-level decisions

Resources

Official Docs

Final Thoughts

Treat how to run use ai agents in cloudflare as an iterative build: baseline first, then reliability and performance hardening.

Full Generated Content (Unabridged)

Only real code appears in code blocks. Other content is rendered as normal headings, lists, and text.

Blog Identity

title: Running AI Agents in Cloudflare Workers
slug: run-ai-agents-cloudflare-workers
primary topic keyword: Cloudflare AI agents
target stack: Cloudflare Workers, AI/ML, TypeScript/JavaScript

SEO Metadata

seoTitle: Run AI Agents in Cloudflare Workers: A Practical Guide
metaDescription: Learn how to deploy and run AI agents directly on Cloudflare's edge network using Workers AI, with step-by-step implementation and real-world patterns.
suggestedTags: ["Cloudflare Workers", "AI agents", "edge computing", "Workers AI", "serverless AI", "LLM deployment"]
suggestedReadTime: 8 minutes

Hero Hook

You built a prototype AI agent that works perfectly on your laptop. Then you try to deploy it. Suddenly you're managing GPU instances, wrestling with cold starts, and watching your infrastructure bill spiral. The edge was supposed to solve this, but most platforms still force you to choose between latency and complexity.

Cloudflare Workers AI changes the calculation. You can run inference directly on their edge network—no containers to manage, no clusters to provision, and response times measured in milliseconds for users worldwide. This isn't a future promise; it's production-ready now with models from Meta, Mistral, and others available via API.

Context Setup

Cloudflare Workers AI provides serverless GPU inference at the edge. Instead of shipping model weights to your infrastructure, you call models hosted on Cloudflare's network. Your agent logic runs in a Worker (V8 isolate), which can chain multiple inference calls, maintain state via Durable Objects or KV, and respond to requests globally with minimal latency.

Prerequisites:

Cloudflare account with Workers AI enabled
Wrangler CLI installed (npm install -g wrangler)
Basic familiarity with TypeScript and fetch APIs
Understanding of LLM prompting patterns

Problem Breakdown

The deployment trap: Local AI development often uses Ollama, LM Studio, or direct API calls to OpenAI. These patterns break when you need global scale. Self-hosting requires GPU machines, model serving infrastructure, and operational expertise most teams lack.

Edge-specific failure points:

Cold starts on serverless GPU: Some platforms take 10-30 seconds to spin up inference containers
State management: Agents need memory across turns; naive implementations lose context
Model availability: Not all models run at the edge; you need to verify compatibility
Cost unpredictability: Per-token pricing varies dramatically; unbounded agent loops become expensive fast

Symptoms you'll recognize: Agents that time out, inconsistent response times across regions, ballooning infrastructure costs, or architectural complexity that slows iteration.

Solution Overview

We'll build a stateful AI agent using Cloudflare Workers AI for inference, Durable Objects for conversation memory, and a simple tool-calling pattern. This approach keeps compute at the edge, maintains sub-100ms response times for cached regions, and scales to zero when idle.

Why this over alternatives:

vs. self-hosted: No infrastructure management, automatic global distribution
vs. centralized APIs (OpenAI, Anthropic): Lower latency for distributed users, no egress costs within Cloudflare's network
vs. other edge platforms: Native integration with Workers ecosystem (KV, R2, D1, Durable Objects)

Implementation Steps

Step 1: Initialize Project and Configure Workers AI

Create a new Worker project and enable AI binding.

implementation-steps-1.sh

mkdir cf-ai-agent && cd cf-ai-agent
wrangler init --yes

Add the AI binding to wrangler.toml:

implementation-steps-2.toml

name = "ai-agent"
main = "src/index.ts"
compatibility_date = "2024-06-14"

[ai]
binding = "AI"

Install dependencies:

implementation-steps-3.sh

npm install @cloudflare/ai

Step 2: Create Basic Inference Handler

Implement a Worker that calls Workers AI with the Llama 3.1 model.

implementation-steps-4.ts

// src/index.ts
import { Ai } from '@cloudflare/ai';

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const ai = new Ai(env.AI);
    
    const messages = [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Explain edge computing in one sentence.' }
    ];

    const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
      messages,
      max_tokens: 100,
    });

    return Response.json(response);
  },
};

Deploy and test:

implementation-steps-5.sh

wrangler deploy
curl https://ai-agent.YOUR_SUBDOMAIN.workers.dev

Step 3: Add Stateful Conversation with Durable Objects

Create a Durable Object to maintain conversation history across requests.

implementation-steps-6.ts

// src/AgentSession.ts
export class AgentSession {
  private state: DurableObjectState;
  private messages: Array<{role: string, content: string}> = [];

  constructor(state: DurableObjectState) {
    this.state = state;
    this.messages = state.storage.get('messages') || [];
  }

  async fetch(request: Request, env: Env): Promise<Response> {
    const ai = new Ai(env.AI);
    const { message } = await request.json();

    this.messages.push({ role: 'user', content: message });
    
    // Keep context window manageable
    const contextWindow = this.messages.slice(-10);

    const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        { role: 'system', content: 'You are a helpful assistant with tool access.' },
        ...contextWindow
      ],
      max_tokens: 500,
    });

    this.messages.push({ role: 'assistant', content: response.response });
    await this.state.storage.put('messages', this.messages);

    return Response.json({ reply: response.response });
  }
}

Update wrangler.toml:

implementation-steps-7.toml

[[durable_objects.bindings]]
name = "AGENT_SESSION"
class_name = "AgentSession"

[[migrations]]
tag = "v1"
new_classes = ["AgentSession"]

Step 4: Implement Tool Calling Pattern

Add structured tool use so your agent can interact with external systems.

implementation-steps-8.ts

// src/tools.ts
export const tools = [
  {
    name: 'get_weather',
    description: 'Get current weather for a location',
    parameters: {
      type: 'object',
      properties: {
        location: { type: 'string', description: 'City name' }
      },
      required: ['location']
    }
  }
];

export async function executeTool(name: string, args: any): Promise<string> {
  if (name === 'get_weather') {
    // Call external weather API or cache
    return `Weather in ${args.location}: 72°F, sunny`;
  }
  throw new Error(`Unknown tool: ${name}`);
}

Integrate into the agent loop:

implementation-steps-9.ts

// In AgentSession.fetch()
const toolPrompt = `You have access to tools: ${JSON.stringify(tools)}. 
Respond with JSON: {"tool": "name", "args": {...}} or {"response": "..."}`;

const aiResponse = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: toolPrompt },
    ...contextWindow
  ],
  max_tokens: 500,
});

// Parse and handle tool calls
let parsed;
try {
  parsed = JSON.parse(aiResponse.response);
} catch {
  return Response.json({ reply: aiResponse.response });
}

if (parsed.tool) {
  const toolResult = await executeTool(parsed.tool, parsed.args);
  this.messages.push({ role: 'assistant', content: `Tool result: ${toolResult}` });
  // Re-run for final response
}

Code Snippets

Snippet 1: Basic Worker with AI Binding

filename: src/index.ts
language: typescript
purpose: Minimal viable AI inference in a Worker
code: |

code-snippet-1.ts

import { Ai } from '@cloudflare/ai';

  export interface Env {
    AI: Ai;
  }

  export default {
    async fetch(request: Request, env: Env): Promise<Response> {
      const ai = new Ai(env.AI);
      
      const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
        messages: [
          { role: 'system', content: 'You are helpful.' },
          { role: 'user', content: 'Hello!' }
        ],
        max_tokens: 100,
      });

      return Response.json(response);
    },
  };

Snippet 2: Durable Object for Stateful Sessions

filename: src/AgentSession.ts
language: typescript
purpose: Persist conversation history across requests
code: |

code-snippet-2.ts

import { Ai } from '@cloudflare/ai';

  export interface Env {
    AI: Ai;
  }

  export class AgentSession {
    private state: DurableObjectState;
    private messages: Array<{role: string, content: string}> = [];

    constructor(state: DurableObjectState) {
      this.state = state;
    }

    async fetch(request: Request, env: Env): Promise<Response> {
      const ai = new Ai(env.AI);
      const { message } = await request.json();

      const stored = await this.state.storage.get<Array<any>>('messages');
      this.messages = stored || [];
      this.messages.push({ role: 'user', content: message });

      const response = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
        messages: this.messages.slice(-12),
        max_tokens: 500,
      });

      this.messages.push({ role: 'assistant', content: response.response });
      await this.state.storage.put('messages', this.messages);

      return Response.json({ reply: response.response, messages: this.messages.length });
    }
  }

Snippet 3: Wrangler Configuration

filename: wrangler.toml
language: toml
purpose: Complete configuration with AI binding and Durable Objects
code: |

code-snippet-3.toml

name = "ai-agent"
  main = "src/index.ts"
  compatibility_date = "2024-06-14"

  [ai]
  binding = "AI"

  [[durable_objects.bindings]]
  name = "AGENT_SESSION"
  class_name = "AgentSession"

  [[migrations]]
  tag = "v1"
  new_classes = ["AgentSession"]

Snippet 4: Router to Durable Objects

filename: src/index.ts (updated)
language: typescript
purpose: Route requests to specific conversation sessions
code: |

code-snippet-4.ts

import { Ai } from '@cloudflare/ai';
  import { AgentSession } from './AgentSession';

  export interface Env {
    AI: Ai;
    AGENT_SESSION: DurableObjectNamespace<AgentSession>;
  }

  export { AgentSession };

  export default {
    async fetch(request: Request, env: Env): Promise<Response> {
      const url = new URL(request.url);
      const sessionId = url.pathname.slice(1) || 'default';
      
      const id = env.AGENT_SESSION.idFromName(sessionId);
      const session = env.AGENT_SESSION.get(id);
      
      return session.fetch(request);
    },
  };

Code Explanation

Key mechanics:

@cf/meta/llama-3.1-8b-instruct: Cloudflare-hosted Llama 3.1 8B. Runs on their edge GPUs; you pay per 1K tokens, not per compute time. The @cf/ prefix indicates a Cloudflare-hosted model.

env.AI binding: Injected at runtime by Cloudflare. No API keys to manage, no external network calls for authentication. The binding routes to the nearest inference endpoint automatically.

DurableObjectState.storage: Transactional key-value storage tied to the Durable Object instance. Survives Worker restarts, maintains consistency across concurrent requests to the same session.

idFromName(sessionId): Deterministic ID generation. Same session ID always routes to the same Durable Object instance, giving you sticky sessions without load balancer configuration.

What can go wrong:

Storage limits: Durable Objects have 1GB storage per instance. Unbounded conversation history will eventually fail. The slice(-12) in Snippet 2 is load-bearing—remove it and long conversations crash.

JSON parsing failures: LLMs don't always return valid JSON for tool calls. The try/catch in Step 4 is essential; without it, malformed responses throw unhandled exceptions.

Cold starts on Durable Objects: First request to a new session incurs ~50-100ms initialization. Design for this—don't assume sub-10ms for the first message in a new conversation.

Validation Checklist

[ ] wrangler deploy completes without errors
[ ] curl https://your-worker.workers.dev returns valid JSON with response or reply field
[ ] Second request to same session ID returns faster (Durable Object warm)
[ ] Conversation history persists across 5+ sequential requests
[ ] Token usage visible in Cloudflare dashboard under Workers AI
[ ] 12+ message history still functions (context window management)
[ ] Invalid JSON from model is handled gracefully (no 500 errors)

Expected behavior: Sub-200ms response times for cached regions, automatic failover to available inference nodes, zero-downtime deployments on wrangler deploy.

Edge Cases

Long-running agent loops: If your agent chains multiple tool calls, you may hit Worker execution limits (50ms CPU time on free tier, 30s wall clock on paid). Use waitUntil for background processing or break chains into multiple requests.

Concurrent session access: Durable Objects process one request at a time per instance. High-frequency updates to the same session serialize automatically—design for this or shard sessions.

Model unavailability: Cloudflare may rate-limit or temporarily remove models. Implement fallback logic:

edge-cases-1.ts

const models = ['@cf/meta/llama-3.1-8b-instruct', '@cf/mistral/mistral-7b-instruct-v0.2'];
for (const model of models) {
  try {
    return await ai.run(model, params);
  } catch (e) {
    if (e.message.includes('model')) continue;
    throw e;
  }
}

Token overflow: Models have fixed context windows. Exceeding them throws errors. Always truncate or summarize history before sending.

Best Practices

Do truncate conversation history to fit model context windows; don't assume infinite memory
Do use Durable Object alarms for scheduled agent tasks, not persistent polling
Do cache tool results in KV or R2 to avoid redundant external API calls
Don't store sensitive data in Durable Object state without encryption; storage is durable but not encrypted at rest by default
Don't call Workers AI from the client side directly; always route through your Worker for rate limiting and prompt injection protection
Do monitor token usage in Cloudflare dashboard; set up alerts for unexpected spikes
Don't rely on specific model versions; Cloudflare updates @cf/ models; pin versions via custom model uploads if stability is critical

Pro Tips

Streaming responses: Use stream: true in ai.run() for real-time agent output. Reduces perceived latency significantly for long responses.

Structured output without tool calls: Add response_format: { type: 'json_object' } (where supported) to force valid JSON without manual parsing.

Batch inference: For non-interactive agents, queue multiple prompts and process with Promise.all—Workers AI handles concurrency well.

Custom models: Upload fine-tuned weights to Cloudflare's model catalog via wrangler ai model upload for specialized agents.

Cost optimization: Use smaller models (@cf/meta/llama-3.1-8b-instruct vs. 70B) for routing/classification, reserve large models for final generation.

Resources

Official Sources:

Cloudflare Workers AI Documentation (https://developers.cloudflare.com/workers-ai/)
Cloudflare Durable Objects Documentation (https://developers.cloudflare.com/durable-objects/)
Workers AI Model Catalog (https://developers.cloudflare.com/workers-ai/models/)
Wrangler CLI Reference (https://developers.cloudflare.com/workers/wrangler/commands/)
Cloudflare AI REST API (https://developers.cloudflare.com/api/operations/workers-ai-post-run-model)

High-Signal Community References:

Cloudflare Community Forum - Workers AI (https://community.cloudflare.com/c/developers/workers-ai/85)
Cloudflare Blog: Workers AI Launch (https://blog.cloudflare.com/workers-ai/)

Final Thoughts

Running AI agents at the edge removes the infrastructure barrier that slows most teams down. You get global distribution, automatic scaling, and sub-100ms response times without managing a single GPU. The tradeoff is less control over model versions and some constraints on execution time—constraints that force better architecture.

Start with a simple stateless agent, add Durable Objects when you need memory, then layer in tool calling. Measure token costs early; they're predictable but not zero. The platform is mature enough for production—what matters now is your agent's logic, not your infrastructure.

Next step: Deploy the code from Step 2, then modify it to call a real API via tool calling. Ship it.

Preview Card Data

previewTitle: Run AI Agents on Cloudflare's Edge
previewDescription: Deploy stateful AI agents with Workers AI and Durable Objects. No GPU management, global scale, sub-100ms responses.
previewDateText: Technical Guide
previewReadTime: 8 min read
previewTags: ["Cloudflare", "AI", "Edge Computing", "Serverless"]

Image Plan

hero image idea: Abstract visualization of neural network nodes distributed across a world map, with connection lines converging on Cloudflare's orange cloud logo at center. Dark background with glowing orange and blue accents.
inline visual 1: Architecture diagram showing request flow—user → Cloudflare edge → Worker → Durable Object → Workers AI inference → response. Label latency targets at each hop.
inline visual 2: Code snippet screenshot with syntax highlighting for the tool-calling JSON parsing section, showing error handling pattern.
inline visual 3: Cloudflare dashboard screenshot mockup showing Workers AI token usage graph and Durable Object instance count.
alt text intent: All images emphasize "edge distribution" and "low latency" themes; avoid generic AI robot imagery.

Key Concepts

Cloudflare Workers AI: Serverless GPU inference service running on Cloudflare's edge network
Durable Objects: Cloudflare's strongly consistent stateful compute primitive for Workers
AI binding: Runtime-injected client for Workers AI, providing authenticated access without API keys
Tool calling: Pattern where LLMs output structured function calls for external execution
Context window: Maximum token limit for model input

Pro TipFor how to run use ai agents in cloudflare, verify installation, run a real-world validation, and document rollback steps before production.

Next Blog

Key Concepts Covered

Context Setup

Problem Breakdown

Solution Overview

Step 1: Define prerequisites and expected behavior for how to run use ai agents in cloudflare.

Step 2: Implement a minimal working baseline.

Step 3: Add robust handling for non-happy paths.

Additional Implementation Notes

Best Practices

Pro Tips

Resources

Final Thoughts

Full Generated Content (Unabridged)

Blog Identity

SEO Metadata

Hero Hook

Context Setup

Problem Breakdown

Solution Overview

Implementation Steps

Step 1: Initialize Project and Configure Workers AI

Step 2: Create Basic Inference Handler

Step 3: Add Stateful Conversation with Durable Objects

Step 4: Implement Tool Calling Pattern

Code Snippets

Code Explanation

Validation Checklist

Edge Cases

Best Practices

Pro Tips

Resources

Final Thoughts

Preview Card Data

Image Plan

Key Concepts

how to use Ai models in cloudflare