What is cost per verified outcome (CPVO)?

CPVO is team token spend in dollars (Claude API plus Claude Code, tagged to teams) divided by verified outcomes: merged changes that shipped and stayed shipped for at least N days without a revert, hotfix, or reopened ticket. It moves the invoice into the numerator where it belongs, instead of treating spend itself as a value signal.

Why are velocity and lines of code broken as AI metrics?

Both were proxies for human effort that only held because writing code and opening pull requests took real human time. AI severs that link. Lines become generated text and PRs open at machine speed, so the charts spike while the thing you care about goes unmeasured. They point up and to the right and can't see the game.

How do you pick the N-day survival window for CPVO?

Set N around your team's incident-detection horizon, meaning how long it typically takes to notice something is broken. Teams with fast rollback and tight observability land near 7 days; slower or more regulated environments push to 14 or 21. Once you pick it, freeze it, and treat any later change like a documented budget revision.

You're Measuring AI Spend, Not AI Value

Q: How do you instrument CPVO?

Plumb two streams together: per-team token spend and per-team verified outcomes. For spend, use OpenTelemetry with team-tagged resource attributes, or paired SessionEnd/SessionStart hooks posting to your own collector. For outcomes, script git: pull merge commits and check the N-day window for reverts, hotfixes, and reopened tickets. A reference harness is at github.com/sneg55/cpvo.

The first time finance flagged our AI dev-tools line, I had to look at it twice. It had grown several times over what it was a year earlier, climbing month over month with the steady confidence of a thing nobody is watching closely. A Claude Code rollout, an Anthropic API line for an internal agent or two. Individually defensible. In aggregate, a number that now showed up on a slide with my name implicitly attached to it.

Then came the question every operator dreads, because it sounds so reasonable. “Is it working?”

I had an answer ready. We had a regular PM dashboard. Velocity was up, PR throughput was up, the charts pointed the right direction. I opened my mouth to say the things you say. And I stopped, because I already knew the velocity chart was a polite fiction. It had been one before the AI tools arrived. All the tools had done was make the line go up faster, which made the fiction look more like proof.

Here’s what I actually knew, standing in that room. The only clean number I had was the invoice. Precise to the cent, auditable, real. And it told me nothing about whether any of that money had produced a single thing worth shipping.

That gap is the whole problem. We can measure what AI costs us with perfect precision and we cannot measure what it buys us at all. So we let the cost stand in for the value, because the cost is the number we have. That substitution is a trap. Here’s what to put on the dashboard instead.

LOC and velocity were already noisy. AI broke them.

Look at any standard engineering dashboard: velocity charts, PR throughput, story points, tickets closed, lines committed. Two metrics sit underneath all of it, and AI broke them both. They were noisy proxies to begin with. AI severed the one assumption that made them useful at all, and now the noise reads like signal.

The first is lines of code. We’ve known LOC is a bad metric for decades, but we kept a loose mental version of it because it was a tolerable proxy for human effort. A person who wrote a lot of code was, roughly, a person doing a lot of work. The proxy was crude but the correlation was real, because writing code by hand is slow and a human can only type so fast. AI severs that link cleanly. The bottleneck on lines is gone. More lines now means more generated text, not more solved problems.

The second is velocity, and its cousins: PR count, story points, tickets closed. These were never measures of value. They were measures of human throughput dressed up as progress, and they held together only because opening a pull request took a human real effort, so the count tracked the work. Point an agent at your backlog and that assumption evaporates. PRs open at machine speed. The chart spikes. The spike means nothing. You’re watching a number that used to mean something become pure theater, in real time, and calling it a productivity gain.

We would never run cloud spend this blind. Nobody signs off on a compute bill that tripled by saying “well, the servers feel busy.” We’d want unit economics, cost per request, utilization. AI dev spend gets none of that scrutiny, and it’s now a bigger line than a lot of the infrastructure we obsess over.

So these charts aren’t merely unhelpful in the AI era. They’re actively misleading: every number points up and to the right while the thing you actually care about goes unmeasured. Those charts say you’re winning. They can’t see the game.

The link between AI spend and delivered value isn’t knowable from spend alone. Token burn measures consumption, not production. You have to instrument the join yourself, because no one is going to hand you the number.

The reframe: cost per verified outcome

Change the unit. Stop asking what you spent. Start asking what each unit of spend bought that survived contact with production. The metric that matters is cost per verified outcome. Call it CPVO. Cost is the token bill in dollars: Claude API spend plus Claude Code spend, tagged to teams. The denominator is what those tokens actually bought that stayed shipped.

          team token spend ($)
CPVO  =  ──────────────────────
           verified outcomes

CPVO isn’t an industry-standard metric. It’s a framing I’m proposing because nothing else fits, which means no vendor exposes it natively and you build it from data you already have. The invoice doesn’t disappear in the reframe. It moves into the numerator where it belongs. The invoice is meaningless as a value signal; as a numerator, divided by what it actually bought, it’s the right number. Put a survival-tested denominator under it and the invoice stops being something you defend and becomes something you can steer.

Compute it at the team level

CPVO at the team level: team-tagged token spend in dollars, divided by the team’s outcome weight. Don’t collapse it to one average and call it a KPI. Watch the distribution and the trend. A team with a healthy median CPVO and a long tail of wildly expensive outcomes has a different problem than a team whose whole distribution is drifting up. The shape tells you where to look.

Alongside CPVO, track review tax. “Review” here means more than just the human. There’s the obvious one: human review-hours per AI-heavy PR versus per human-authored PR. A senior engineer can spend forty minutes picking through a 400-line agent-written PR that would have been a five-minute read by hand. The cost of generating dubious code dropped to nearly zero, and someone still has to read all of it. Three other costs sit in the same gap. CI burns AI tokens of its own (review bots, test-generation passes, codegen-assisted linters), and those tokens are real money. Cloud spend on test, staging, and preview environments climbs with PR volume, since each PR drags its own pipelines and compute behind it. And the reviewer’s own AI usage rolls in: the senior engineer running Claude over the diff to help them read what the agent wrote is spending tokens just to keep up. None of this shows up in the original AI line item, and all of it scales with PR volume, which AI inflates. Measure it on your own corpus. It’s exactly the cost that hides in the gap between the spend report and the headcount line.

What “verified outcome” means

A verified outcome is a merged change that shipped and stayed shipped. It has to clear an observation window: the team picks N days, and a change only counts once it has been in production for at least N days without something coming back to bite. Outcome weight is the count of changes that cleared that bar over the period you’re measuring. CPVO uses that count as its denominator.

The shape of the window matters more than the exact number. CPVO is trailing by N days. Work shipped this week can’t be counted until N days from now. That’s the point. You’re measuring whether spend bought durable outcomes, and you can’t know that the day of merge. Pick N around your team’s incident-detection horizon. How long does it typically take to notice that something is broken? Teams with fast rollback and tight observability land around 7. Slower release cadences or more regulated environments push to 14 or 21. Most settle in that range. Once you’ve picked, freeze it. If N has to move later, document why and make the change public. Treat it like a budget revision. Quietly re-tuning the window is the first way this metric gets gamed.

What disqualifies an outcome is worth pinning down so two people on the same team don’t compute it differently. A revert is any git revert commit touching the original change’s files inside the N-day window. A hotfix is a follow-up PR labeled, tagged, or titled as a hotfix that touches the same area within 72 hours of the merge. Catch the obvious ones, then tune the rule on your own data. A reopened ticket is a linked issue (Jira, Linear, GitHub) that transitions from a done state back to open inside the N-day window. Be liberal: better to miss a few survivors than to claim outcomes you’d actually take back.

That count alone isn’t enough. A high-volume team can hide a lot of failed work behind a big numerator of survivors. So track verified share (survivors divided by total attempted changes) next to CPVO, never instead of it. CPVO answers “what does a verified outcome cost us?” Verified share answers “are most of our changes actually surviving?”

                     verified outcomes
verified share  =  ──────────────────────
                     attempted changes

Outcome weight isn’t a subjective “did this change matter?” importance score. Importance is unmeasurable and infinitely spinnable. The only weighting is the objective survival test. Both numbers are coarse (low-volume teams will be noisy), so treat them as steering signals rather than an accounting ledger.

A worked example, illustrative. Team A ships 40 changes a month but eats 12 reverts and hotfixes: outcome weight 28, verified share 70 percent. Team B ships 30 clean: outcome weight 30, verified share 100 percent. At equal token spend, Team B has the better CPVO and the better quality signal, and the raw 40-vs-30 throughput chart hides both.

Two more things to calibrate. First, compare CPVO inside release-cadence bands, not across them. A team shipping hourly with fast rollback is exposed to more failure events than a team batching weekly, and putting their CPVOs side by side is comparing apples to oranges. Second, set up a scope cordon for work the survival test doesn’t see: research spikes, throwaway prototypes, refactors that prevented an outage you can’t count. Cordon those off explicitly. Track what fraction of token spend lives in the cordon. If it’s small, CPVO covers most of the picture. If it’s a third or more, your dashboard is steering on a sample that excludes some of the highest-leverage use, and you need to say so.

Is the AI bet paying off?

The top altitude, the one the board actually asked about. Don’t reach for a cohort comparison. Most orgs are at roughly uniform AI adoption levels, and you won’t find two comparable teams with meaningfully different usage to put against each other. Run a trend on the same teams over time instead. Where was their CPVO and verified share two quarters ago, where is it now, and what direction is it moving? Same denominator as the team-level number, longer window.

If CPVO is dropping and verified share is holding or climbing, the bet is paying off. If CPVO is flat or drifting up while verified share slides, the spend isn’t producing, and you have a baseline to negotiate the budget against. The answer might be “faster cycle time, not fewer engineers,” or it might be a null result. A null result beats a vanity chart, because the null tells you to keep your money and the chart tells you to spend more on something that isn’t working.

How to instrument it

You need two streams plumbed together: per-team token spend (the numerator) and per-team verified outcomes (the denominator).

For the token side, there are two viable options. Pick whichever fits your existing infra.

Option A: OpenTelemetry with team-tagged resource attributes. Claude Code and the Claude Agent SDK both ship with OTel instrumentation built in. Set CLAUDE_CODE_ENABLE_TELEMETRY=1, point OTEL_EXPORTER_OTLP_ENDPOINT at your collector, and inject OTEL_RESOURCE_ATTRIBUTES=tenant.id=<team> per repo or per developer. Token-count and cost metrics emit automatically, tagged with the team, and land in whatever backend you already run: Datadog, Honeycomb, Grafana, Langfuse, or a self-hosted collector. The team join happens server-side at query time. Requires an OTel pipeline; if you don’t have one stood up, that’s the real setup cost.

Option B: Session hooks reporting to your own collector. Configure a paired set of hooks in .claude/settings.json. A SessionEnd hook fires when a session ends, reads the session’s token totals, and POSTs them to an internal endpoint with the developer’s identity. A SessionStart hook on the next launch sweeps the on-disk transcript directory for any prior session that exists locally but was never reported (laptop slept, terminal force-closed, process killed mid-stream) and POSTs those too. Together they catch the abrupt-exit gap without a separate cron. Map identity to team server-side. Ship the hook config in the repo so every dev who pulls it picks the hooks up automatically. The collector is a thirty-line endpoint that writes to whatever store you already use. No OTel infra required, but you own the script.

A useful complement to either option: provision one Anthropic API key per team, named after the team. The Console’s per-key usage page becomes a per-team usage page you can sanity-check against your measurement pipeline, and finance gets a clean per-team line on the invoice without depending on it.

For the outcomes side, the work is in git. For each team’s repo, pull merge commits over the measurement period. For each merge, check the N-day window after it: any git revert commit touching the same files? Any hotfix-labeled or hotfix-titled PR in the same area within 72 hours? Any linked ticket reopened inside the window? If none of those, the merge is a verified outcome. GitHub’s REST or GraphQL API gives you all of it in a few paginated calls; GitLab and self-hosted are the same idea on a different endpoint. The denominator falls out of a script you write once.

Use weekly grain. The goal is finding the region where high burn meets low outcome, and weekly resolution surfaces that clearly over a few weeks. Per-penny accuracy isn’t what this is for.

A starter dashboard is four cuts: tokens (and dollars) per team per week, outcome weight per team per week, verified share per team per week, CPVO trend. Four cuts, one screen.

┌──────────────────────────────┬──────────────────────────────┐
│ tokens & $ per team / wk     │ outcomes per team / wk       │
│                              │                              │
│       ▁▃▅▆▇▇▆▆▇              │       ▃▅▆▇▆▆▇▆▇              │
├──────────────────────────────┼──────────────────────────────┤
│ verified share per team / wk │ CPVO trend ($ / outcome)     │
│                              │                              │
│   100% ─ ─ ─ ─ ─ ─           │       ▇▇▆▅▅▄▄▃▃              │
│    80% ▇▇▆▇▇▆▇▇              │                              │
└──────────────────────────────┴──────────────────────────────┘

If you’d rather start from running code than a blank script, I put a small reference harness on GitHub: github.com/sneg55/cpvo. Clone it, run cpvo demo, and the whole method runs end to end on synthetic data: per-team CPVO with its distribution and expensive tail, verified outcomes scored by the same revert, hotfix, and reopened-ticket survival test described above, and the four-cut starter dashboard. It joins per-team spend to git outcomes at weekly grain, exactly the join this section describes, and it refuses to invent the data it doesn’t have, so per-PR cost is deliberately unrepresentable. It’s a measurement reference, not a product. Point it at your own token exports and git history when you’re ready.

The Monday playbook

Three steps to start.

First, set up team-tagged token reporting. Either OTel with tenant.id resource attributes, or paired SessionEnd/SessionStart hooks posting to your own collector. Whichever fits your existing infra. Then pull one month of team-tagged token usage plus the team’s merged-outcome counts and revert/hotfix signals from git. Weekly grain. The git side you have today; the team-tagging is a half-day with either approach, and the only reason most orgs say they “can’t” measure this.

Second, compute team CPVO and verified share. Look at the shape: distributions and trends. The average is exactly where the signal goes to hide.

Third, set a token budget per team and review it monthly. Scale up the teams where the unit economics work, pull back where they don’t.

The point of all of this was never to spend less on AI. Spending less is easy; you just turn it off. The point is to know what you’re buying. Right now the only number most of us can defend is the invoice, and the invoice is the one number that tells us nothing. Fix that, and “is it working?” stops being the question you dread and becomes the one you can finally answer.