Claude Code in a Real SaaS Codebase

Claude Code is the first AI coding agent that has changed the shape of how I run a SaaS engineering loop, not just its speed. Used well, it shortens the time between intent and a verifiable diff from a day to minutes. Used badly, it produces plausible-looking pull requests that quietly break invariants nobody tested. The teams I see getting real value treat it as a junior engineer who can work unattended for twenty minutes — not as autocomplete. This is the wiring I use, where it breaks, and the five guardrails that decide whether it earns its keep.

Why this matters this week

Anthropic has been shipping changes to the agent stack on a near-weekly cadence. The Claude Agent SDK now exposes the same agent loop that powers Claude Code as a programmable interface in Python and TypeScript, with hooks like PreToolUse, PostToolUse, SessionStart, and UserPromptSubmit (Claude Agent SDK overview). Starting June 15, 2026, Anthropic separates Agent SDK usage from interactive subscription limits, meaning the marginal cost of running unattended agents drops further (Anthropic engineering: building agents with the Agent SDK).

The question for engineering leaders is no longer "should we use this?" It is "where does it earn its keep, and where does it generate cleanup work?" That is a CTO question, not a tooling question.

The loop and the tools

Claude Code's loop is plain: gather context, take action, verify work, repeat. The built-in toolset is the part that matters: Read, Write, Edit, Bash, Glob, Grep, and a search tool. Everything Claude Code can do in your terminal, an Agent SDK script can do unattended. That is the part that is both useful and dangerous.

The compact feature summarizes prior turns when the context window fills, so a long session does not lose state. Hooks intercept tool calls: you can refuse Bash on strings that look like production hosts, log every Edit to a file, or auto-format with Prettier on PostToolUse. Subagents let you delegate verification — one agent writes, another reads its diff and reports.

How I wire it into a SaaS codebase

The wiring on every project I run looks roughly the same. Five steps, in order.

Containerize the working surface. Claude Code runs against a containerized checkout of the repo with no host network access and no production credentials. The agent cannot reach prod even if I told it to.
Write CLAUDE.md per project. This is the file that compounds. Mine reads like a brief: we use Drizzle, never raw SQL except in `migrations/`; tests live in `tests/`, not next to source; when you finish a change, run `pnpm typecheck && pnpm test`. The teams getting value have CLAUDE.md files that read like onboarding docs. The teams not getting value have prompts.
Block dangerous tools with hooks. A PreToolUse hook that refuses Bash when the command contains `production`, `prod`, or any environment variable from `.env.production`. A PostToolUse hook that aborts if a file outside the allowed directory was touched. Hooks are the cheapest insurance you will ever write.
Use subagents for verification. The main agent writes. A second agent — a quieter one with a much smaller toolbox — reads the diff and asserts the contract: no new endpoints, no schema changes, tests still cover the affected file. The verification step is what separates a junior engineer from autocomplete.
Output to PRs, not to main. Every change Claude Code makes lands in a branch with a generated commit message and opens a PR. Review is a human step. Always. If the workflow does not end at a PR, it is not ready.

Where it actually breaks

Three places, in order of how often I have watched them go wrong.

Implicit context. Claude Code is brilliant when the project is described. It is confidently wrong when it is not. If your codebase has a convention only documented in muscle memory, Claude Code will violate it and write a paragraph explaining why its violation is fine. Fix: write the conventions down. The same exercise makes your team faster too.

Verification theater. "I ran the tests" is not the same as "I ran the tests and they passed." Agents will declare success on a build that printed errors mid-stream. Always end the loop with a Stop hook that re-runs tests and exits non-zero on failure.

Long-tail edits. The agent fixes the test it was asked to fix, then breaks two adjacent tests it did not read. This is the failure mode that ages the worst. Mitigation: scope tightly. Do not ask for "fix the failing tests"; ask for "fix the failing test in `tests/billing/usage.test.ts` and only edit files in `src/billing/`."

What this changes for a founder or CTO

Hires do not accelerate the same way Claude Code does. A senior engineer at full ramp delivers maybe twenty PRs a week. A senior engineer with Claude Code wired in delivers more — but the binding constraint changes. It becomes review capacity, not authoring capacity. If your engineering manager spent forty percent of their time on PR review last quarter, they will spend sixty percent next quarter.

That is not a complaint. It is a planning input. The two questions to answer this quarter: do you have the review bandwidth to absorb the throughput, and do you have the architectural clarity to make agent work productive instead of expensive? If the answer to either is no, the right move is not to adopt the tool faster. It is to fix the gap that is holding back adoption.

The second-order effect is what the agent gets delegated. The work I delegate confidently: boilerplate CRUD, test scaffolding, mechanical refactors (renames, dependency bumps, codemod-style migrations), config plumbing, and writing the first ninety percent of documentation. The work I do not delegate: anything that touches the data model, anything that crosses a service boundary, anything that affects money. Not because the agent will fail — because the cost of an error in those zones is asymmetric, and the time saved is not worth the audit trail you give up.

On cost economics: Anthropic's June 2026 change pulls Agent SDK usage out of the interactive subscription bucket, which matters because the long-running, unattended workflows are exactly the ones you want to scale without each engineer worrying about their own limit. Plan your rollout to use the SDK for batch jobs (test suite generation, codemod sweeps, PR review subagents) and the interactive Claude Code for the keyboard-driven loop.

My perspective

The thing I keep relearning from building chays.ai — my fractional-CTO automation platform built on persistent-memory agents — is that the value compounds with how well the project is described, not with how clever the prompt is. A CLAUDE.md that says "we use Drizzle, never raw SQL" saves an hour of review per PR. Vague prompts produce vague PRs. And, counterintuitively, do not let the agent write its own CLAUDE.md. It optimizes for being correct on the things it can see, not for the constraints you actually care about.

Recommended action this quarter

Pick one repo. Write one CLAUDE.md by hand. Wire one PreToolUse hook that blocks Bash on production strings. Run it for two weeks. Measure: average PR turnaround, percentage of PRs requiring a second pass, percentage of PRs needing rollback. That data tells you whether to scale it across the org or step back and fix the underlying clarity problem.

If you are scoping a Claude Code rollout for a team — wiring guardrails, deciding what to delegate, sizing the review bandwidth — and want a second pair of eyes, the next section is for you. I help fractional CTO engagements and audits navigate exactly this transition.

How I Wire Claude Code Into a Real SaaS Codebase (and Where It Breaks)