There are three kinds of codebases an engineer opens in 2026, and an agent feels each of them in its bones within about thirty seconds.
The first is a repo that was designed for agents from the first commit. Tiny files, feature-based folders, explicit interfaces, living CLAUDE.md rules, a test suite that locks behavior, constants instead of magic strings. You point Claude Code or Cursor at it and the thing moves. Features ship in an afternoon. Refactors don’t introduce duplication. The agent edits surgically because the surgery targets are small and well-labeled.
The second is a seven-year-old monolith written by humans who are mostly no longer at the company. Tribal knowledge lives in two engineers’ heads and a Confluence page that hasn’t been touched since the rewrite that never happened. A REST endpoint does three unrelated things because in 2019 it was convenient. An agent dropped into this repo will confidently rewrite the architecture, miss seventeen endpoints, and produce a pull request that passes the tests it wrote for itself. This is the failure mode the industry keeps writing about, and the uncomfortable detail is that the tests the agent wrote for itself usually pass.
The third is the one nobody talks about enough: a six-month-old repo that was vibecoded from the start, with no linter, no tests, no folder structure, no conventions, no CLAUDE.md. It’s not legacy. It’s not AI-native. It’s AI slop — produced by agents, inherited by a human, and now badly stuck in the middle.
These three cases get conflated in the “AI agents and legacy code” discourse, and the conflation matters because the right move is different in each one. Greenfield is about discipline. True brownfield is about strangulation. Vibecode inheritance is about installing the rules that should have been there on day one.
Which case are you in?
One question that actually decides it, and two that color in the picture.
The decisive question: does the repo have working guardrails? Linter, formatter, type checker, pre-commit hooks, tests on CI, a CLAUDE.md or equivalent. All of them, holding up under daily pressure, means greenfield. None of them — or theatrical versions that get bypassed before every release — means vibecode. Partial guardrails plus years of drift means true legacy. The presence and enforcement of guardrails is the single most predictive trait of how agents will perform in your repo.
The two coloring questions tell you how much rope the codebase still gives you. How old is the codebase, and who wrote most of it? Older than about three years and mostly written by humans who’ve moved on points to true legacy. Younger than about a year and mostly generated by agents points to vibecode. Fresh from commit one and designed for agents is greenfield, and you’ll know because you set it up yourself. Are the original authors still reachable? If yes, you have vibecode or greenfield latitude. If the tricky parts have already become folklore, you’re in legacy territory regardless of age.
The boundaries are real but soft. An eighteen-month-old AI-built repo with a half-installed linter sits on the seam between vibecode and legacy, and its playbook borrows from both. The guardrails axis still tells you where to start.
Greenfield AI-first: when the foundation does the work
When a codebase is designed around agents from the start, most of what agents are bad at stops happening. The shape below isn’t yet a proven industry standard, since there aren’t enough public, inspectable AI-native codebases to claim convergence. But a small cluster of starter kits, internal team writeups, and one detailed worked example (Anthropic’s Claude Code CLI) keeps producing the same pattern:
- Files are small. Aim for 200 lines, accept 300 as a ceiling; Anthropic’s own Claude Code CLI keeps roughly 64% of its files under 200 LOC across ~1,900 files and 512k LOC of TypeScript, with most of the rest falling under 300. (Disclosure:
agent-starteris my own repo, and the guide I’m citing is my reverse-engineering of the Claude Code CLI structure, not Anthropic’s own docs.) - Code is organized by feature, not by layer. In Claude Code, for instance, the agent has a
BashTool— the module that lets it run shell commands on the user’s behalf — and aBashTool/directory holds everything related to it: the main implementation, its constants, the prompt the LLM sees when invoking it, its security checks, its UI rendering. All in one place, instead of scattered acrosscontrollers/,models/,views/. The same shape works for any feature: one folder, all the pieces. - Strings are not magic. Error codes, tags, tool names, UI labels all live in a
constants/module with sequential IDs and comments. Agents can grep them. Humans can grep them. Refactors don’t create orphaned literals. - Errors are explicit and hierarchical. An
AppErrorbase class with typed subclasses beatsthrow new Error("something went wrong")every time, because agents pattern-match class hierarchies much more reliably than they reason about string contents. - Rules live in layered
CLAUDE.mdfiles. A root file holds project-wide conventions;.claude/rules/*.mdholds domain-specific ones; local overrides handle module quirks. The agent composes them automatically. - Tests exist and run fast. Characterization tests lock observed behavior so refactors stay safe.
The speed gain once the foundation exists is hard to overstate, and the reason is mechanical rather than magical: the agent’s context window fits the thing it’s editing, so it doesn’t have to guess at the rest. Refactors stop introducing duplication because the agent can see the duplicate. Features compose cleanly because the interfaces are explicit. Long-term maintenance cost also drops, because the properties that make a codebase agent-friendly are mostly the same properties that make it human-friendly: modularity, explicitness, discipline.
The catch is upfront cost. Designing for agents is a skill. Writing a good CLAUDE.md hierarchy is a skill. Enforcing file-size limits and constants discipline is a skill. A team that hasn’t internalized these patterns will spend the first sprint unproductively. For most new projects in 2026, that sprint is worth it. For a throwaway prototype or a two-week experiment, it probably isn’t.
True brownfield: when the legacy is someone else’s
The 7-year-old monolith is the hard case, and the honest answer is: don’t just point an agent at it.
What actually works is the strangler fig, which has been around since Martin Fowler wrote about it in 2004 and turns out to be the correct playbook for AI agents too. You don’t rewrite. You wrap. You pick a module, document its behavior with characterization tests, extract a clean interface, replace the innards, then route traffic to the replacement. Repeat. Eventually the old code is surrounded and quiet, and you switch it off.
The agent’s role in this is not architect. It’s laborer, stenographer, and first-pass reviewer. Agents are excellent at:
- Reading a large module and producing a dependency graph and module summary.
- Writing characterization tests from observed behavior.
- Doing mechanical refactors once the spec is written by a human.
- Spotting duplication and obvious security smells on a sweep.
They are bad at:
- Deciding what the right architecture is.
- Knowing which ugly bits are load-bearing because of some customer contract from 2018.
- Holding 50k+ LOC of context coherently across a multi-step refactor.
The practical pattern is: humans set direction and review, agents do the volume. Expect to spend a meaningful slice of the project on preparation before the agent writes a single line of new code: auto-documentation, dependency mapping, characterization tests, graph-RAG context tooling. None of it is glamorous, and skipping it is how you end up with a big confident PR that breaks a customer integration from 2018. That prep work is what converts a brownfield repo into something an agent can actually operate in.
The uncomfortable realization that usually arrives about a month in: the refactoring that makes the codebase agent-friendly is mostly the refactoring the codebase needed anyway. The agent is an excuse to pay down debt. If that’s how you frame the project to stakeholders, you’ll get the runway.
Vibecode inheritance: the case nobody talks about
Now the case I think is most common in 2026 and least written about: you inherit a six-month-old repo that was vibecoded from day one. No tests. No linter. Inconsistent naming. Files that are 1,200 lines long because the agent kept appending to them. Duplicated utility functions in four places because each feature generated its own. Zero documentation. A README that says “MVP.”
This is not legacy. The business logic is fresh. The total surface area is probably under 50k LOC. Whoever built it is either still on the team or recently gone, and either way the context hasn’t calcified into tribal knowledge yet.
This is the easiest of the three to fix, which makes it a strong refactoring candidate even though I have no survey data to put a number on the ROI.
The trap is treating it like legacy. Running a full brownfield strangler fig on a six-month-old MVP spends more total work on the refactor than on the original build, which is absurd. The right frame: this codebase never had guardrails, install them now, and let agents do the cleanup. How long that takes depends on your tooling and how much you parallelize, and I’m not going to pretend I can predict your wall clock. What I can give you is the ordering.
But shouldn’t we just rewrite?
Sometimes, yes. The honest answer is that for a sub-30k-LOC vibecoded MVP whose business logic is fully recoverable from one founder’s head, a from-scratch rewrite with the rules in place from commit one can beat refactor on calendar time. The refactor playbook below wins when (a) the codebase is closer to 50k LOC than 5k, (b) shipping has to continue during the cleanup, (c) there are paying users whose behavior the code has implicitly contracted to, or (d) the original authors are no longer reachable to answer “why does it do this.” If none of those are true, run the rewrite. The rest of this section is for the case where at least one of them is.
The other objection worth naming: “we should just keep shipping features.” Sometimes that’s right too. The refactor only earns its place when the velocity loss to slop is already costing you more than the cleanup will. If you’re not feeling that loss yet, schedule the cleanup for when you do, not now.
The playbook
What follows is opinionated and mine, no survey data behind it, just the sequence that respects the dependencies. It assumes a TypeScript-shaped stack because that’s what I work in, but the ordering — audit, baseline, scope CI to the diff, ratchet hooks, refactor by feature — translates directly to Python (ruff + mypy), Go (golangci-lint + gofmt + staticcheck), or anything else with a comparable toolchain. The target structure I cite is the agent-starter large-codebase best practices guide, my reverse-engineering of the Claude Code CLI; substitute your own house starter if you have one.
Phase 1 — audit and lock behavior. Point Claude Code at the repo with a prompt like this one:
Analyze this repository and produce:
1. A dependency graph (which modules import which).
2. Every file over 300 lines, with line counts and a one-line summary.
3. Duplicated functions or near-identical logic blocks across files.
4. Obvious security smells: raw SQL concatenation, shell exec with
unvalidated input, hardcoded credentials, missing authn/authz.
5. A one-paragraph summary of each top-level module: what it does,
what it depends on, what depends on it.
Output as markdown. Do not propose fixes yet.
Then, before changing anything, generate characterization tests on the critical paths:
Read src/<module>. For every exported function, write a test that:
- Calls it with realistic inputs taken from observed usage
(logs, existing fixtures, production samples if you have them).
- Asserts the current observed output exactly.
- Does NOT change logic or fix apparent bugs.
Goal: lock current behavior so we can refactor safely.
Name them <function>.characterization.test.ts.
That is your safety net. Vibecode has none by default, and refactoring without one is how you ship silent regressions into production.
Phase 2 — turn on the guardrails, without nuking CI. This is the part the original build skipped: lint, format, type-check, hooks, and CI gates all missing. Turning them on naively floods CI with thousands of violations that have nothing to do with the current work, the team disables the blocking almost immediately, and the guardrails become theater. The fix is to treat the existing mess as debt to drain, not errors to resolve before anything else can ship. This phase has five sub-steps, and they have to happen in this order:
First, an auto-fix pass. Run biome check --write --unsafe (or eslint --fix plus Prettier) across the whole tree in a single reviewable commit. No logic changes. This typically removes most violations in one pass because they were formatting, import order, and trivial syntax — never controversial, just never configured.
Then baseline the rest. Snapshot the remaining violations into a checked-in .lint-baseline.json, or use betterer, which does this natively. CI fails only when the violation count grows. The debt is visible (it’s in the repo) and bounded (it can only shrink).
Then copy templates and scope CI to the diff. You need a root CLAUDE.md with the target conventions (feature folders, 200/300-LOC file size, no magic strings, explicit error hierarchy, strict naming); Biome or typescript-eslint configured to enforce them; an error-ID registry module that gives every typed error a stable string ID; and empty constants/, types/, schemas/, utils/ directories at the repo root for shared primitives. Then wire CI to lint the diff, not the tree: biome check $(git diff --name-only origin/main...HEAD -- '*.ts' '*.tsx'). A bug fix in module A never trips over violations in module B. Full-tree lint runs on a separate cadence as a trend indicator.
Then ratchet the hooks instead of hard-ceiling them. A pre-commit check-file-size hook gets a two-part rule: new files must be ≤300 lines; existing files already over 300 can still be edited, but the line count can only stay the same or shrink. Same shape for silent-catch detection: existing empty catches persist, new ones can’t be added. The code can’t get worse, bugfixes don’t get blocked. A lint-on-edit hook runs per-edit so the agent’s own output bounces off the rules in real time.
Finally, tier the rule severities. The hard rules ship as errors immediately: silent catches, any in new code, file-size ratchet, magic numbers in new code. Softer rules (naming conventions, import order, complexity thresholds) ship as warnings now and get promoted to errors per-module as the refactor phase brings each module into compliance. By the end of the refactor the whole tree is under the full rule set because the refactors brought it there, not because CI forced it all at once.
Exit criterion for this phase: new code entering the repo is clean, and the existing mess is catalogued, bounded, and scheduled to drain. That’s the difference between guardrails teams keep and guardrails teams turn off the moment they block a release.
Phase 3 — feature-by-feature strangler fig, agent-heavy. Pick the first module by three criteria, in order: (1) which module has the highest change frequency in the last quarter (git log --since=3.months --name-only --pretty=format: | sort | uniq -c | sort -rn | head -20 gets you a ranked list in one command); (2) which module has produced the most production incidents; (3) which module new team members ask the most questions about. The intersection of those three is where refactoring compounds fastest.
For each selected module, hand the agent this prompt:
Refactor <module-path> into the feature-folder format documented
in CLAUDE.md. Specifically:
1. Create src/<FeatureName>/ and move all related code into it.
2. Split the main file so each file stays under ~200 lines (300 max)
and has a single responsibility. Target layout:
<feature>.ts — thin entry point, orchestration only
constants.ts — every string literal, error code, tag ID
errors.ts — typed error classes extending AppError
types.ts — input/output types
validate.ts — input validation
security.ts — if the feature has a security dimension
3. Replace magic strings with named constants.
4. Replace `throw new Error(...)` with typed error subclasses.
5. Break circular imports with dedicated tiny files that include
a header comment explaining the break.
6. Update imports across the repo to point at the new paths.
7. Do not change test assertions.
Run the test suite after and report any failures before stopping.
What the transformation actually looks like on the ground, starting from a vibecoded bashTool.ts that accumulated responsibility over months:
// src/bashTool.ts — 847 lines
export async function runBashTool(cmd: string) {
if (!cmd) throw new Error("no command");
if (cmd.includes("rm -rf /")) throw new Error("dangerous");
if (cmd.length > 10000) throw new Error("too long");
if (!/^[a-zA-Z0-9 ./-]+$/.test(cmd)) throw new Error("bad chars");
// ...800 more lines mixing validation, prompt construction,
// subprocess exec, stdout streaming, UI rendering, and retry logic
}
becomes a feature folder that an agent can actually reason about:
src/tools/BashTool/
├── bashTool.ts // thin entry, ~80 lines
├── constants.ts // BASH_LIMITS, BASH_ERROR_IDS, tag strings
├── errors.ts // BashCommandError extends AppError
├── security.ts // dangerous-pattern checks
├── validate.ts // input validation
└── types.ts // BashInput, BashResult
// src/tools/BashTool/bashTool.ts
import { BASH_LIMITS } from "./constants";
import { validate } from "./validate";
import { checkSecurity } from "./security";
export async function runBashTool(input: BashInput): Promise<BashResult> {
validate(input);
checkSecurity(input.cmd);
return execute(input);
}
Every piece is now individually grep-able, individually testable, and small enough that the agent can hold the whole thing in context without guessing at the rest. The agent does the bulk of the mechanical work on a refactor like this; the human reviews, sets direction, and catches the load-bearing ugliness.
Phase 4 — polish and verify. Run a full repo-wide agent pass with the new rules. Enforce file-size limits, constants, naming everywhere. Run the full test suite. Do a human architecture review. Update documentation.
Ongoing — Boy Scout Rule. Every future PR leaves the touched files compliant with the target format. That’s the only way the gains hold.
Three cases, not two
The two-way comparison people usually draw — greenfield versus brownfield — misses the vibecode case entirely. A three-way version is more useful:
| Dimension | AI-First Greenfield | Vibecoded Inheritance (~6mo) | True Legacy Brownfield (years) |
|---|---|---|---|
| Upfront cost | Lowest (scaffolding up front) | Medium (install the missing guardrails) | Highest (prep before any change) |
| Agent effectiveness | High from day one | Low now, high once guardrails are in | Low, stays low without deep prep |
| Risk of silent regression | Low — tests from day one | Medium — characterization tests first | High — tribal knowledge everywhere |
| Right agent role | Co-author | Cleanup crew | Laborer under a human architect |
| Right human role | Architect of rules | Installer of rules | Keeper of tribal knowledge |
| Dominant failure mode | Over-engineering the scaffolding | Treating it like legacy and over-investing | Treating it like vibecode and under-investing |
The vibecode column is where most teams actually are, and it’s the one most of the discourse misses. Legacy treatment is too heavy for it. Greenfield patterns won’t adopt themselves. What it needs is guardrails installed retroactively and a focused cleanup pass. After that, it behaves like greenfield.
What the next two years look like
Two long-term steady states are emerging — agent-native and unprepared-legacy — with vibecode sitting between them as a transition that resolves quickly in either direction. Most teams in the middle category will spend a quarter or two there, do the cleanup, and end up behaving like one of the steady states. The naming is going to matter in a few years.
Teams building greenfield AI-first right now share a specific shape: feature-folder organization, constants extracted into dedicated modules with explicit IDs, typed error hierarchies instead of string-based errors, layered CLAUDE.md rules that compose from root to module. The shape will keep spreading for a mechanical reason: agents make it cheaper to follow the rules than to break them. The pre-commit hook refuses the violation, the agent reruns with the correction, the human never has to argue about style in code review. Style debates were always a tax on the median PR; agents pay the tax automatically. Anthropic’s Claude Code CLI is the most-cited reference implementation today because it’s one of the largest public codebases shaped this way end-to-end, but it won’t be alone for long.
Teams sitting on true legacy will spend a long stretch doing strangler-fig modernization. The ones that invest in preparation and framing come out with codebases that are genuinely pleasant to work in. The ones that try to shortcut it by dropping agents on unprepared code produce the next generation of unmaintainable systems: AI-generated spaghetti layered on human-written spaghetti, which is somehow worse than either alone.
And teams that inherited vibecode — startups with six-to-eighteen-month-old MVPs, agencies cleaning up client handoffs, engineers hired specifically to “productionize” an AI-built prototype — those teams have the highest leverage and the shortest feedback loop of anyone in 2026. A focused cleanup pass turns AI slop into something that behaves like an AI-native codebase. Very few things you can do to a repo have that kind of return.
The through-line across all three: the codebase you’re editing was either designed for agents or it wasn’t, and if it wasn’t, the question is how much debt you have to pay to make it so. Greenfield pays once, before you ship. Vibecode pays in a single focused cleanup pass. Legacy pays continuously across a long modernization. But every one of them pays.
The unusual position in 2026 belongs to whoever just inherited a vibecoded repo. The code is fresh, so the business logic hasn’t calcified into folklore. The surface is small, so the whole thing fits in one sitting. The original builder is probably still reachable, so the “why” is recoverable. And the target structure has been reverse-engineered from at least one production AI-native codebase and written down. Four conditions for a clean refactor — context, scope, access, and a known-good blueprint — happen to line up at once, which won’t be true forever. The window closes the moment the original authors leave, or the repo crosses some hard-to-name complexity threshold. Today’s inherited AI slop is, for that narrow window, one of the highest-leverage refactoring projects available — provided the team takes the rewrite-vs-refactor question seriously and lands on the right side of it.