What Claude Code's Source Code Reveals About Building Production AI Agents
Anthropic accidentally leaked Claude Code's source. I read through it. Here are 6 architecture patterns that are changing how I build agents for clients.
On March 31, Anthropic accidentally shipped Claude Code's full TypeScript source in an NPM package update. About 1,900 files. Gone public, then yanked — but not before the internet had it.
Haseeb Qureshi published a detailed breakdown of what's actually in there. I went through it. Here is what stood out from a builder's perspective — not the drama, but the decisions.
These are patterns from a system that runs millions of agentic sessions. The harness, not the model, is where the production experience lives.
The Four-Layer Compaction Strategy
Context management is the hardest unsolved problem in production agents. Claude Code has four distinct layers:
Layer 1 — Proactive: Monitor token counts before API calls. Summarize older messages before you hit the limit.
Layer 2 — Reactive: Catch prompt_too_long errors, compact retroactively, retry. The user sees nothing.
Layer 3 — Snip: For headless/SDK automation where memory must stay bounded — truncate instead of summarize.
Layer 4 — Context Collapse: Compress verbose tool outputs mid-conversation. Persist each collapse as a ContextCollapseCommitEntry record so it can be selectively reversed.
Most agent builders I talk to handle context management with one strategy: hope the conversation stays short. Claude Code has four strategies with fallbacks.
The practical takeaway: before you ship an agent that runs long sessions, design your compaction ladder. What happens at 50k tokens? At 100k? After a tool result dumps 10,000 lines of logs?
The System Prompt Dynamic Boundary
Claude Code splits its system prompt at a marker called __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Everything above is static (~3,000 tokens of behavioral instructions). Everything below is dynamic — per-session CLAUDE.md files, MCP server configs, active tools.
The static section gets cached globally via Blake2b hashing. One cache, amortized across every session. The dynamic section loads fresh each time.
This is not a small optimization. At scale, token costs compound fast. The right architecture caches everything that doesn't change and pays full price only for what must be fresh.
For client agents: segment your system prompt intentionally. Core persona, rules, and task framing stay static. User context, active workflow state, and retrieved data load after the boundary. You get caching on the expensive part for free.
Where is your AI budget leaking?
Free snapshot. No credentials. Results in minutes.
Permission Denials Feed Back Into the Model
This one surprised me. When Claude Code denies a tool call — user blocked it, hook rejected it, mode disallows it — it doesn't just stop. It wraps the denial as a tool result and feeds it back to the model.
Claude sees: "You tried to write to this file. Permission denied." It adjusts. It tries something else. It asks the user differently.
The system also tracks denial patterns across sessions and surfaces them in the IDE.
Most agent builders treat permission failures as terminal states. The agent hits a wall, errors out, stops. Claude Code treats them as information the agent should reason about.
This changes how you think about guardrails. Instead of blocking as a hard stop, block and explain. Let the agent negotiate within its constraints rather than fail silently.
Asymmetric Session Persistence
User messages write synchronously. Assistant messages write asynchronously.
The reasoning is precise: if the process dies between user input and the API response, you need to be able to resume from where the user was. Async user writes would mean that queue operations exist in the transcript but the user message might not — resume fails.
Synchronous user persistence guarantees resumability from the moment the user's message was accepted.
This is the kind of detail that only shows up after you've debugged data loss at 2am. The implementation encodes a specific failure mode that bit someone in production.
For any agent that needs to be resumable after crashes: persist user intent first, synchronously, before you do anything else.
The Verification Agent Pattern
The source contains a VERIFICATION_AGENT flag. When enabled, it spawns an adversarial sub-agent that reviews non-trivial changes before they go through.
It also ships explicit anti-hallucination rules for internal Anthropic engineers: things like "never claim all tests pass when output shows failures."
These are baked into the prompt as process.env.USER_TYPE === 'ant' branches — different instructions for internal vs. external users.
The verification agent pattern is underused in production builds. Single agents are wrong in specific, predictable ways. A second agent reviewing the first agent's output catches a different class of errors — especially on high-stakes actions like writing files, sending messages, or triggering external APIs.
I've started adding lightweight verification steps to client agent builds. Not full adversarial sub-agents for everything, but a review pass before any irreversible action. The prompt injection post I wrote yesterday covers why this matters for security — the verification step is also a defense layer.
Memory Prefetch During Streaming
While the model is streaming a response, Claude Code prefetches CLAUDE.md memory files in the background using TC39 explicit resource management (using keyword).
The model is talking. The system is loading the next context it will need. I/O latency hides behind generation latency.
This is production engineering applied to agents. The same principle that makes databases fast — prefetch what you'll need before you need it — applies directly to agentic workflows.
For agents with retrieval steps: start fetching context while the previous step is completing, not after. The user's perceived latency drops without any change to what you're actually doing.
The Real Takeaway
There's a comment in query.ts warning about thinking block constraints. It's written in mock-medieval phrasing — something like "heed these rules well, young wizard" — explicitly designed to prevent future developers (and future models) from skimming past a subtle API requirement that cost an entire day of debugging.
That comment is more instructive than any architecture diagram. It says: the people building this have been burned by specific failure modes, and they build the memory of those failures directly into the code.
Production agents are not demos with better prompts. They are systems that have been hit by every edge case and are still running.
The leak shows what that looks like from the inside. Four-layer context management. Permission denials as feedback. Asymmetric persistence. Adversarial verification. Memory prefetch during generation.
None of it is magic. All of it is earned.
If you're building agents that need to run reliably in production — not just in demos — most of these patterns apply regardless of which model you're using. The harness is the product.
If you want a review of your current agent architecture against patterns like these, the async audit is the fastest way to get there. No meetings, written deliverable, 48-hour turnaround.
Ready to automate?
I build AI agents and automated workflows. Async delivery. No meetings. Flat rate.
Start a ProjectGet new posts delivered to your inbox
No spam. Unsubscribe anytime.
More from the blog
A 9-Person Startup Replaced Its Dev Team With AI Agents. Here's the Part That Actually Matters.
JustPaid ran 7 AI agents 24/7 using OpenClaw and Claude Code and shipped 10 features in a month. Their first bill was $4,000 a week. Here is what that number tells you.
Why 88% of AI Agent Pilots Never Ship (And How to Be in the 12%)
Most AI agent projects die before reaching real users. Here's why they fail — and what the teams that actually ship do differently.
Serving a Live LLM From My Home Office: What Local Inference in Production Actually Looks Like
I run Llama 3.1 8B on an RTX 5070 Ti from my home office. Here's the actual setup, when it makes sense for a real business, and when it doesn't.