Why use a different vendor's model for code review instead of the same one that wrote the code?

A model reviewing its own output shares the blind spot that produced the bug. Errors from one training distribution are correlated, the model is biased toward its own choices, and an agent that also wrote the tests will watch those tests pass. A rival lab's model has different data, different post-training, and a different set of confident-wrong zones, so it fails in places yours does not. The value is not that two checks beat one but that the two checks are decorrelated, which is what independent verification actually requires.

What is codex-plugin-cc and what does it do?

It is a plugin that lets you run OpenAI's Codex from inside Claude Code, using your local Codex CLI, authentication, and config rather than a separate runtime. It exposes review as a first-class step through three surfaces: /codex:review for a plain read-only review of your diff or branch, /codex:adversarial-review for a steerable review that questions the design, and an optional Stop-gate hook that runs a Codex review on Claude's last turn and blocks the turn from ending if it finds a blocking issue. It also supports delegating tasks to Codex, but the review path is where cross-vendor review earns its keep.

What is the review gate and what are its risks?

The review gate is a Stop hook that runs a targeted Codex review of Claude's previous turn whenever that turn made code changes. If Codex returns BLOCK, the stop is refused and Claude keeps working until the issue is addressed. This makes verification the default rather than a step you remember to run. The risks are direct: it can create a Claude and Codex loop that drains usage limits quickly, and a clean review is not a proof of correctness, so it can hand you false confidence. It is best enabled only when you are actively watching the session.

Does cross-vendor review replace the human reviewer?

No. It displaces self-review, not the reviewer. A second model with a decorrelated failure surface catches more than a single model can, but it still cannot own the ship decision, weigh product tradeoffs, or take responsibility for what goes out. The honest framing is that this pushes verification down to your own keyboard and makes it cheap, while the harder question of whether models can verify each other well enough to remove the human stays open.

Two Labs, One Desk: Why Your Code Reviewer Should Be a Different Model

Two labs are now checking each other’s work at my desk, and neither one knows it is being graded by the other.

The setup fits in a sentence. Claude writes the code. Codex reviews it. The part that matters is the part that sounds like a footnote: they do not share a brain. One was trained by Anthropic, the other by OpenAI, on different data with different post-training, and that difference is not an inconvenience to route around. It is the entire reason the arrangement works.

A model cannot audit itself

Ask a model to review the code it just wrote and you get a check that looks rigorous and is quietly hollow. The problem is not effort or capability. It is correlation.

A model’s mistakes are not random. They cluster in the places its training left thin: the same misread of an API contract, the same optimistic assumption about what a caller passes in, the same confident blind spot around a race that only shows up under load. When the same model reviews its own output, it walks back through those exact clusters carrying the exact priors that produced them. The bug and the review come from one distribution. If the model was wrong the first time, it tends to be wrong in agreement the second time.

There is a sharper version of this that anyone running agents has already met. An agent writes a feature, writes the tests for that feature, and the tests pass. Of course they pass. They were written by the same process that wrote the code, encoding the same misunderstanding of what correct means. Green checkmarks certify internal consistency, not correctness, and a self-reviewing model is just a slower way of producing the same false green.

Independent verification is supposed to break that loop. But independence is doing real work in that phrase, and a second pass from the same model is not independent. It is the same author with a fresh coat of confidence.

Why a rival lab, not a second Claude

So run the review through a model that fails somewhere else.

A different vendor’s model has a different failure surface. Different training data means different gaps. Different post-training means different instincts about what looks safe. The zones where Codex is confidently wrong are not the zones where Claude is confidently wrong, because nothing about their construction lines them up. Point one at the other’s output and each one’s blind spots land on the other’s well-lit ground.

I want to be careful about the strength of this claim, because it is easy to oversell. I have no hit rate to offer you, no measured percentage of bugs caught, and I would not trust one if I generated it. The claim is structural, not empirical. It says the errors of two models from two labs are less correlated than the errors of one model with itself, and that lower correlation is what independent verification needs to mean anything. Even a second pass from the same model helps a little, because a fresh context drops some of the momentum behind the first answer. Cross-vendor review is the stronger version of that same move: not a different mood, a different mind.

The obvious objection is that one of the two models is simply better, so you should use the better one for everything and stop dressing it up. That gets the goal wrong. This is not a contest to rank the models. If it were, the loser would have nothing to add. The asset here is difference, and a worse model that fails in unfamiliar places still surfaces things a stronger, more familiar one waves through. You are not buying a second opinion from a smarter friend. You are buying a second opinion from a stranger.

What makes it cheap

None of this is new in principle. You could always copy a diff into another tool, in another window, signed into another account, and read back what it thought. Almost nobody did it on every change, because the friction was exactly high enough to make skipping it the rational choice.

The codex-plugin-cc plugin removes that friction by moving the reviewer into the room. It runs Codex from inside Claude Code through your local Codex CLI and app server, using the same install, the same authentication, and the same config you would use if you called Codex directly. There is no second runtime and no context switch. The rival model becomes a slash command in the session you are already in, which is the difference between a practice you admire and a practice you actually keep.

Three surfaces, rising in aggression

The plugin exposes cross-vendor review at three levels of confrontation, and they are worth separating because they answer different questions.

The mild one is /codex:review. It runs a plain read-only review of your uncommitted changes, or of your branch against a base with --base main, and reads much like running Codex’s own review directly. It is the daily driver: a competent outside read of the diff, no theater.

The pointed one is /codex:adversarial-review. This is a steerable review whose job is not to validate the change but to attack it, and the prompt behind it does not hedge about that. It tells Codex its role is to “break confidence in the change, not to validate it,” to “default to skepticism,” and to give “no credit for good intent, partial fixes, or likely follow-up work.” If something only works on the happy path, it is instructed to treat that as a real weakness. You can aim it, too, appending focus text so it questions a specific decision: the caching strategy, the retry design, whether the whole approach was the safe one. This is the surface you reach for before shipping something you are slightly too proud of.

The aggressive one is the Stop-gate. Enabled through /codex:setup --enable-review-gate, it installs a Stop hook that fires when Claude tries to end a turn. If that turn made code changes, Codex reviews them, and its first line is forced to be either ALLOW or BLOCK. On a BLOCK, the stop is refused, and Claude cannot walk away from its own work until the issue is dealt with. This is the real inversion. Review stops being a step you remember and becomes the default condition of finishing. The author does not get to decide it is done. The other lab does.

The bounded version: loop until it goes quiet

There is a fourth pattern that sits between running a review by hand and handing the Stop-gate the keys, and it is the one I reach for most. Claude Code’s /loop repeats a command or instruction, self-paced, until a condition you set is met. Point it at the reviewer:

/loop run /codex:review, address every finding, and continue until there are no critical or high findings left

Now the two labs converge without me in the middle. Claude makes a change, Codex reviews it, Claude fixes what comes back, Codex reviews the fix, and the cycle repeats until the rival model stops raising anything serious. The difference from the Stop-gate is the exit condition. You name the bar, no critical or high, instead of blocking on every objection, so the loop has a defined place to stop instead of arguing forever over a nit. It is the Stop-gate with a thermostat: run hot until the serious findings are gone, then quit.

The same shape composes with the sharper surfaces, and this is where it earns its place in a real workflow. Before I open a pull request, I point the loop at the adversarial reviewer and compare the whole branch against its base instead of the working diff:

/loop run /codex:adversarial-review --base main, address every finding, and continue until no critical or high findings remain

That reviews everything the branch changed, not just what is currently uncommitted, and keeps the skeptical prompt in the seat the entire way down. When I already know where the risk lives, I aim it and let it grind on that one area until it goes quiet, since the adversarial review carries any focus text you append after the flags:

/loop run /codex:adversarial-review --base main challenge the auth and rollback paths, fix what it finds, and continue until there are no critical or high findings

The focus text rides along on every pass, so each iteration re-attacks the same soft spot from the rival model’s angle rather than wandering off into cleanup. The catch on all of these is that you have explicitly told the thing to keep going, so the costs in the next section apply with the volume turned up.

Where it bites

The Stop-gate is also where the honesty has to come in, because the same feature that makes verification automatic makes two expensive failure modes automatic with it.

The first is the loop. Claude edits, Codex blocks, Claude edits again, Codex blocks again, and you are now paying two labs to argue across the table while the token meter runs on both sides. The plugin’s own documentation warns that the gate can drain usage limits quickly and should only be enabled when you are actively watching the session, which is the correct warning and worth repeating. An automated adversary that never tires is a gift right up until it will not let a reasonable change through.

The second failure is quieter and worse. A clean cross-review feels like proof, and it is not. Codex signing off means one model with one set of blind spots did not object. That is genuinely more than self-review buys you, but it is not a correctness guarantee, and the danger is that the ceremony of a second lab’s approval makes you check less carefully yourself. This lands exactly where I ended up in The Verification Economy: execution keeps getting cheaper while verification bends slowly, and the open question underneath all of it is whether the verifier itself can ever be trusted to automate. Two models checking each other does not answer that question. It makes it concrete enough to run into every day.

Running the loop so it does not run you

A loop you leave unattended needs a few habits, and all of them fall out of one fact: the state that matters lives in the working tree, not in the conversation. Every pass of the reviewer re-reads the actual diff and re-derives its findings from the files, so the chat history is disposable. Once you take that seriously, the rest follows.

Compact freely between passes. Because each review recomputes its findings from the code, you can summarize the conversation or even restart the session and the next pass loses nothing it cannot rebuild. That is what keeps a long loop from getting quadratically expensive, since otherwise every round drags the full weight of the last one along with it. The one thing worth preserving through a compaction is not the model’s own account of what it fixed, which is just the self-review you already decided not to trust, but the factual ledger: the count of critical and high findings per pass, and any finding that reappeared after being marked done. An instruction like /compact keep the findings tally and anything that came back retains the signal that matters without anchoring the model into believing its own prior verdicts.

That reappearance signal is also your real stopping rule. The danger is not too many passes, it is oscillation, where pass six quietly reintroduces what pass five fixed and the counter never reaches zero. So stop when the critical and high count stops strictly falling, not just when some fixed number of rounds is up. Keep a hard cap underneath that as a backstop, sized to the change, a handful of passes for a small diff and more for a large branch, but treat the cap as a smoke alarm rather than a finish line. Reaching it means the loop got stuck, so it should hand the leftover findings to a person, never mark the change clean. A cap that silently ships code with known unaddressed findings is worse than no gate, because now there is a green check sitting on top of the bug.

Then tell the loop to keep a log, written somewhere the reviewer will not turn around and review. It is the same artifact three of these habits already wanted: the durable memory that survives a restart, the record of counts your stopping rule reads, and the audit trail the team version needs. Just stay clear about what the log is. It is testimony, a narrative of how you got to clean, not proof that you are. The proof is the last pass that re-read the code and found nothing. The log only tells you the story of getting there.

When to reach for it, and when not to

For a throwaway script, a config nudge, or a change you can fully hold in your head, a second lab is overkill and the friction is not worth it. Reach for it when the change touches the expensive surfaces: auth, data you cannot un-delete, rollback paths, anything concurrent, anything with a blast radius. That is where correlated blind spots are most likely to hide and most costly when they do.

And through all of it the human still owns the decision. Cross-vendor review displaces self-review, not the reviewer. A second model with a decorrelated failure surface will catch things a single model cannot, but it cannot weigh the product tradeoff, cannot decide the risk is acceptable this week, and cannot be the name attached to what ships. What this setup actually does is take the thing I keep arguing is becoming the real bottleneck, verification, and push it down to my own keyboard where it is cheap enough to run on every change.

Past one desk

Everything above is a single developer with two models and a handful of slash commands. A team cannot run on habits, so the same idea has to move from something you remember to do into something the pipeline does for you.

The Stop-gate is the tell. At a desk it is a local hook you opt into; for a team it becomes a required status check on the pull request. Continuous integration runs the rival model headless, posts findings as review comments, and branch protection refuses the merge on anything critical or high. The loop from earlier stops being an interactive command and becomes a bounded CI job with a hard cap on passes. Nobody has to remember to run it, and nobody can quietly skip it.

Two things then have to become policy rather than preference. First, the reviewer must be a different lab than the author, enforced, because the whole argument dies the moment a team reviews Claude code with Claude. Second, one platform team owns the adversarial prompt and the severity rubric centrally, the way they already own the lint config, so the skeptic is calibrated the same across every repo instead of reinvented per developer.

Then you route by blast radius, which is the same call from the last section written down as config instead of made fresh each time. The expensive adversarial cross-vendor loop runs on the paths that can actually hurt you, auth, billing, migrations, infrastructure, and everything else gets the cheap pass or nothing. Run the aggressive version everywhere and you will drain usage and train the team to rubber-stamp a reviewer that cries wolf. Dismissed findings should feed back into the shared prompt, and someone should watch the reviewer’s precision, because a gate nobody trusts is just latency.

What does not move is the human at the end. A named person still approves the merge, because accountability does not automate, and the review log, which models looked, what they found, who overrode, becomes the thing you point to when someone asks whether it was verified. In a regulated shop that record is not overhead. It is the product.

Whether the two labs can eventually verify each other well enough to leave the human out, or whether they only relocate the bottleneck to the seam between them, I do not know yet. For now I will take the arrangement I have: two models that fail in different places, made to look each other in the eye before I sign off.