Claude Agent SDK Development

Production agents built on the Claude Agent SDK

The Claude Agent SDK (Python and TypeScript) is Anthropic's framework for building agents that plan, use tools, and complete multi-step tasks autonomously. Building a working prototype is straightforward. Building a production system that handles errors gracefully, manages costs intelligently, passes evals consistently, and runs reliably at scale requires experience. We run OpenClaw. our 12-agent production framework. on this stack every day.

12
Agents in OpenClaw framework
15+
Client workflows automated
67
Skills wired into agents
5-6
Tokens/sec. fine-tuned model latency

What We Build

Six agent capabilities we deliver.

Agent Loop Design

The agent loop. the cycle of perceive, plan, act, observe. is the core of any Claude agent. Getting it right means defining clear stopping conditions, handling ambiguous states, and knowing when to escalate to human review vs. retry autonomously. Poor loop design is the primary source of runaway costs and unreliable behavior.

  • Loop architecture and state management
  • Stopping condition design
  • Error classification and retry logic
  • Human-in-the-loop escalation patterns

Tool Use & Function Calling

Agents become powerful when they can act. search the web, query a database, write a file, call an API, send a message. We design tool schemas that Claude calls reliably, handle parallel tool calls, manage result parsing, and implement tool chaining for complex workflows.

  • Tool schema design and documentation
  • Parallel tool call orchestration
  • Result validation and error recovery
  • Tool chaining for multi-step workflows

Prompt Caching

Prompt caching is one of the highest-value optimizations in Claude API usage. Long system prompts, instruction sets, reference documents, and conversation context can be cached, reducing cost by up to 90% and latency by 80% for cache hits. We design cache-friendly prompt architectures and monitor cache hit rates to ensure you get the benefit in practice.

  • Cache-optimal prompt structure design
  • Cache key management
  • Hit rate monitoring and optimization
  • Cost savings measurement and reporting

Extended Thinking

Claude's extended thinking mode lets the model reason through complex problems before producing a final answer. It is substantially more accurate for multi-step reasoning, code generation, and tasks requiring planning. We scope which workflows benefit from extended thinking (and which ones do not. it is slower and more expensive) and integrate it correctly.

  • Use case suitability assessment
  • Budget allocation for thinking tokens
  • Streaming thinking output handling
  • Quality improvement measurement vs. standard mode

Multi-Agent Orchestration

Complex tasks benefit from specialized agents. one for research, one for writing, one for validation. Our OpenClaw framework coordinates 12 specialized agents. We apply the same patterns to client builds: orchestrator agents that delegate to specialists, parallel agent runs for speed, state passing between agents, and fault isolation so one agent failure does not cascade.

  • Orchestrator and specialist agent design
  • Parallel agent execution
  • State and context passing between agents
  • Fault isolation and partial failure recovery

Evaluation & Testing

Production agents need evals: automated test suites that measure output quality across a representative sample of real inputs. We design eval frameworks with pass/fail criteria and regression tests that run on every agent update, covering latency, cost, and output quality. Agents that cannot be measured cannot be trusted in production.

  • Eval dataset design from real examples
  • Pass/fail criteria definition
  • Regression test automation
  • Latency and cost benchmarking

How We Build Agents

Four stages from spec to production.

Step 01

Task and workflow specification

We work with your team to define the agent's scope precisely: what triggers it, what it does at each step, what a correct output looks like, what the acceptable failure modes are, and where human review is required. Agents built to a vague spec fail in vague ways. We insist on precision before we write code.

Step 02

Architecture and cost modeling

Before implementation, we design the agent architecture and model the expected API costs. What model tier? Where does caching apply? Which tasks justify extended thinking? What is the expected token usage per run at the target volume? Teams that skip cost modeling get surprise bills. We eliminate the surprises.

Step 03

Build with progressive testing

We build incrementally: core agent loop first, basic tool use, then progressive complexity. Each stage is tested against real inputs before the next layer is added. This approach catches architectural problems early, when they are cheap to fix, rather than after the full system is built.

Step 04

Eval framework and production handover

We build the eval framework alongside the agent, not after. At handover, you receive the agent code, the eval suite, a cost monitoring setup, and a runbook covering common failure modes and how to debug them. Your team can operate and extend the agent without us.

OpenClaw in Production

How a 12-agent system runs 15 client SEO workflows daily.

The Challenge

High-quality SEO work requires coordinated execution across research, content strategy, technical analysis, and reporting. A single agent handling all of these functions produces shallow results. A team of human specialists is expensive and slow. There had to be a better architecture.

Our Solution

We built OpenClaw: a 12-agent Claude-based orchestration system where each agent is a specialist. The orchestrator receives a client task, decomposes it, routes subtasks to the right specialist agent (researcher, writer, validator, schema generator, reporter), collects results, and produces the final deliverable. Agents run in parallel where the task structure allows. Costs are managed through aggressive prompt caching and model tier selection. Quality is measured through an 8-dimension automated scoring system.

Results Achieved

12
Specialized agents
Each with defined scope and evals
Yes
Parallel agent runs
For independent subtask clusters
8-dim
Quality validation
Automated, every output
15+
Client workflows automated
Across B2B SaaS, e-commerce, travel

Agent Architecture Patterns

The patterns behind reliable production agents.

Most agent failures trace to a small set of architectural mistakes. These are the patterns we apply to prevent them.

Minimal footprint principle

  • Request only the permissions the task requires
  • Prefer reversible actions over irreversible ones
  • Confirm before irreversible writes
  • Log every action with full context
  • Surface ambiguity to humans, do not guess

Structured output design

  • Define the output schema before the system prompt
  • Use JSON mode for all structured data
  • Validate every output against the schema
  • Include confidence scores where applicable
  • Never let the agent self-report success without validation

Cost control patterns

  • Cache long system prompts and reference docs
  • Use Haiku for classification, Sonnet for generation
  • Set token budgets per task tier
  • Monitor cost per run, not just total spend
  • Alert on budget overruns before they compound

Eval-driven development

  • Define eval criteria before writing the agent
  • Sample at least 50 real examples per use case
  • Run evals on every prompt change
  • Track pass rate over time, not just current run
  • Fail builds that drop below the quality threshold

FAQ

Claude Agent SDK development frequently asked questions

The Claude Agent SDK is Anthropic's Python and TypeScript library for building AI agents on top of Claude. It provides the primitives for agent loops, tool use, multi-turn conversations, and integrations with external systems. It is the programmatic alternative to Claude Code CLI. where Claude Code is for interactive developer use, the Agent SDK is for building applications and automation systems that run agents in production.
A simple API call is appropriate when the task is single-step, the input and output are fully specified, and no tool use or intermediate reasoning is needed. An agent is appropriate when the task has multiple steps where each step depends on the result of the previous one, when the system needs to call external tools, when the path through the task is not fully predictable in advance, or when the task requires planning before acting. Most production AI use cases that go beyond content generation need agents.
Prompt caching is an Anthropic API feature that stores the key-value cache of a prompt prefix (your system prompt, instructions, reference documents) so it does not need to be reprocessed on subsequent calls. For agents with long system prompts, this reduces cost by up to 90% and latency by up to 80% for cache hits. The savings are significant at scale. a team sending 1,000 agent calls per day can reduce API spend by 60-80% with proper cache design.
Extended thinking gives Claude a dedicated token budget to reason through a problem before producing its final response. It substantially improves performance on tasks requiring multi-step reasoning, planning, architecture design, and complex code generation. The tradeoff is cost (thinking tokens are billed) and latency (reasoning takes time). We use it selectively for tasks where accuracy matters more than speed.
With proper design, yes. We implement failure classification (tool error vs. API error vs. logical error), retry logic with exponential backoff, partial success handling, and escalation to human review for failures the agent cannot resolve. The key is designing the error states explicitly before building. agents that encounter errors they were not designed to handle tend to loop or produce garbage output.
We build an eval dataset from 50-100 real examples of the target task and define explicit pass/fail criteria for each output dimension. The eval runs on every significant code change. We use automated scoring where the output has a checkable property, and LLM-as-judge scoring where quality requires interpretation. Agents go to production only after hitting the agreed threshold: typically an 85-90% pass rate.

Ready to Build a Production Agent?

Tell us what you want to automate.

We scope agent builds quickly. A 30-minute call is enough to determine whether what you want is feasible, what it will cost to run, and how long it will take to build.

  • Free cost modeling before we start
  • Eval framework included in every build
  • Full ownership. your code, your infrastructure