We plant known bugs in real code, then measure
what AI code reviewers actually catch.
Across 3 projects we plant 36 bugs — 9 critical / 9 high / 9 medium / 9 low — plus safe-but-suspicious noise. Every tool gets the same prompt; findings are scored by an LLM judge, over 3 runs. It's an informal one-person experiment, so treat it as a directional signal — the numbers can shift as tools improve and as more projects are added.
- recall
- — share of a project's planted bugs a tool found. The headline metric.
- FP
- — false positives per run — real code wrongly flagged. The second axis.
- bonus
- — extra real issues found beyond the planted set (not scored as recall).
- native
- — the tool ran its own review engine, not our shared prompt — so it's not directly comparable.
Small, fully-readable projects. A single LLM judge. One language (Python) so far. Costs are API-equivalent list-price estimates, not billed spend. This is an informal, one-person experiment — a directional signal, not gospel.
All 3 projects, combined
Every project's planted bugs merged — 36 in total (9 critical, 9 high, 9 medium, 9 low). Each severity cell shows how many run-hits were found out of the possible total, pooled across the 3 projects; the overall column is mean recall.
severity cells: found / possible run-hits (planted × runs, all projects) · overall: mean recall · FP: mean/run · bonus & $: totals for one full pass · speed: mean s/run
measured · 36 configs × 3 runs · generated from harness/config.toml + harness/results
Which bugs get found
Every planted bug (columns) against every configuration (rows), one project at a time. It shows the bugs the whole field misses, not just the totals.
✅ found every run · ⚠️ some runs · ❌ never · rows ranked by recall · hover a cell for the bug
Which bugs are hard for AI
Flip the axis: score the 36 planted bugs by how findable they are across every configuration. A bug is "found" by a config if it caught it in a majority of runs.
% = mean caught-fraction across all configs · n/m = configs that found it (majority of runs)
What if you ran all of them?
Combining tools is a different question from ranking them. Counting a bug as found when a config catches it in a majority of runs, across all 36 planted bugs:
Recall vs reasoning effort
For each model that ran more than one effort tier, overall recall from its lowest setting to its highest. Hover a point for detail.
measured · overall recall across all projects · axis truncated to show the band
What the next tier buys
For each model with multiple effort tiers, the change at every step: how much recall you gain (in points), and what it costs in dollars and seconds.
Δ recall in points · costs are per-run estimates · green = gain, red = loss
Recall vs cost
Each configuration that reports a cost, plotted by estimated cost per run against overall recall. Tools that report no token usage (e.g. the manual and native reviewers) are omitted. Hover a point for detail.
dashed line = Pareto frontier (most recall for the cost) · faded points are dominated · cost is an API-equivalent estimate, not billed spend
Recall per dollar
Overall recall divided by estimated cost per run — how much bug-finding each dollar buys. Tools with no cost reported are omitted.
bar = recall ÷ $/run (higher is more bug-finding per dollar)
Recall vs false positives
The second axis. Overall recall against false positives per run — points on the left edge reported none. Hover a point for detail.
measured · false positives = safe code wrongly flagged · axis truncated to show the band
The writeup below is RESULTS.md, rendered verbatim — it was written by Claude (an AI) from the measured data. The charts are the measurement; this reading is one informal interpretation and may change as tools improve and projects are added.
An informal, honest benchmark of AI code-review tools. Each project hides a known
set of planted bugs, and every tool is scored by an LLM-as-judge (Claude Opus,
headless) against the answer key. Recall = planted bugs found. FP = false positives
per run. Costs are API-equivalent estimates (what the measured tokens would cost
on the API), not actual subscription spend. Default 3 runs per config; a bug
counts by stability (found every run / some / never). See harness/ for how to
reproduce.
Subjects: codex gpt-5.5 (efforts low→xhigh), claude opus-4.8 (low→max),
cursor-agent composer-2.5-fast, cursor bugbot (run manually), and
CodeRabbit CLI. The coding models and Bugbot share one standard prompt;
CodeRabbit runs its own review engine (marked native, and it reports no tokens so
it has speed but no cost). 30 automated model configs × 3 runs, plus Bugbot and
CodeRabbit × 3 projects.
Caveats: small, fully-readable projects; a single LLM judge; one language (Python) so far;
nativetools aren't directly comparable on the prompt axis; costs are list-price estimates, not billed spend. Treat as a directional signal, not gospel.
Project 1 — python-basic (textbook web-backend bugs)
12 planted bugs: SQLi, command injection, pickle RCE, path traversal, IDOR, unsalted
MD5, TOCTOU transfer, float-for-money, off-by-one pagination, mutable default,
broad-except, is vs ==.
Findings: near-saturated at the top — the famous vulnerabilities (SQLi, command
injection, pickle, path traversal, MD5, TOCTOU) are caught 3/3 by essentially
everyone. The whole spread comes from two non-flashy bugs: float-for-money
(opus missed it almost entirely — 0/3 at every effort except one lucky max run;
codex needed high+ to lock it) and broad-except (only composer got it 3/3;
most configs never did). Effort plateaus or inverts — codex peaks at high (94%)
and drops at xhigh (92%, and picks up 1.3 FP). composer reaches 100% for $0.23
while opus can't clear 86% even at max ($0.67). CodeRabbit, the one
purpose-built reviewer, lands at 83% (2nd from bottom) — though it flags the most
extra real issues of anyone (bonus 6.7), so it's thorough but not accurate.
Project 2 — python-pricing (subtle money-math correctness)
12 planted bugs, all subtle correctness, no famous vulns: unclamped discount →
negative price, per-line coupon over-charge, tax on the wrong base, float-money
refund, /30-vs-days_in_month proration, tier boundary off-by-one, truncate-vs-
round, wrong discount base, dropped remainder, > vs >=, cents truncation,
empty-input crash.
Findings:
- One bug beats everybody. The unclamped discount that yields a negative price (C1) was missed 0/3 by every codex and every opus config, by Bugbot, and by CodeRabbit; only composer caught it, and only 1/3. It's the single hardest item in the suite — plausibly because "should this function validate its input?" is a judgment call, not a mechanical defect.
- Effort barely moves it — and isn't monotonic. codex is 83% at
low,high, andxhigh(with a 89% blip atmedium); opuslow(92%) ties opusmax(92%) — for 4× the cost ($0.23 vs $0.93). Reasoning depth buys nothing here. - codex has a crash blind spot. The empty-input
ZeroDivisionError(L3) it caught atlow/mediumbut missed 0/3 athighandxhigh— more effort, fewer catches. - CodeRabbit comes last (75%) — the specialist reviewer is out-recalled by every general model configuration on the subtle-money project.
Project 3 — python-scheduling (date/time & calendar correctness)
12 planted bugs: one-directional overlap, cross-date conflict miss, naive-vs-UTC
comparison, timedelta.seconds (drops .days), off-by-one recurrence, trailing
free-slot, no range validation, touching-end busy, 30-day "months", insertion-order
"first", inclusive-end contains, empty-schedule max() crash.
Findings: the one project where opus leads — and the one where effort
actually helps, but only up to high (low 92% → high 100%), then plateaus and
dips at max (97%). For codex it inverts hard: low scores 97% at $0.17 in
72s, while xhigh scores 89% at $0.45 in 467s — six times slower and worse.
codex also has a blind spot on the empty-schedule crash (L3): 0/3 at medium,
high, and xhigh, while opus, composer, and bugbot all catch it. The trailing
free-slot off-by-one (H3) is flaky for everyone — the genuine coin-flip of the set.
CodeRabbit is last by a clear margin (67%) — nearly 20 points below the field.
What it all says
- composer-2.5-fast is the value standout — 🥇 on basic and pricing, mid-pack on scheduling (92%), 0.0 false positives on all three projects, at $0.13–0.23/run. On this evidence it's the pick for routine bug-finding, and remarkable for its price.
- The purpose-built review products don't win. CodeRabbit — a dedicated AI code reviewer — lands last or 2nd-from-last on all three (83 / 75 / 67%), and Bugbot, though clean, never tops a project either. The general coding models and agents out-recall the specialist tools. And because the whole (small) project is in front of every tool, this is a reasoning gap, not a retrieval one — a tool that misses a bug it can fully see won't do better on a larger codebase, only worse.
- bugbot is the precision play — never wins a project (89 / 92 / 94%) but never adds a single false positive, and its extras are real. If a noisy reviewer is worse than a quiet one for your workflow, that profile matters more than raw recall. (CodeRabbit is the opposite trade: more extras, lower recall.)
- No model wins everywhere. composer takes the two correctness-heavy projects; opus owns date/time. Match the tool to the domain.
- Effort ≠ care, and often ≠ value. Higher reasoning effort helped only on
scheduling, only up to
high. Elsewhere it was flat (opuslow=maxon pricing) or inverted (codexlowbeatxhighon scheduling; codex lost crash catches athigh/xhigh). Thexhigh/maxtiers almost never earned their 3–5× cost or their multi-minute latency. Thoroughness reads as a model property, not a knob you can turn up. - The discriminating axis is subtle correctness, not exotic topics. Famous vulnerabilities (basic) near-saturate; famous patterns don't separate tools at all (an earlier concurrency project scored everyone ~100% and was scrapped). The sharpest single discriminators here were the quiet ones: a discount that goes negative, a swallowed exception, a float where cents belong.
- False positives stayed rare and small. Almost every FP > 0 came from codex —
spread across
low,medium, andxhigh(up to 1.3 on basicxhigh), not clustered at any one tier — plus a single opuslowrun and one CodeRabbit pricing run (0.3). composer and bugbot never produced one across all three projects.
Method, provenance, and how to add tools/projects: harness/docs/ and
harness/REFERENCES.md. Full per-bug tables: harness/reports/summary.md.
I care about code review. As AI writes more of the code — and writes it faster — picking the right reviewer has quietly become the thing I spend the most effort on, and the thing I could find the least honest data about. Almost every tool claims higher recall and less noise than the rest; almost none of them show the numbers.
This isn't a formal or authoritative benchmark, and it may not discriminate as sharply as I'd like. But even a small, open experiment starts to surface the questions that actually matter: does more reasoning effort really help? Do the purpose-built reviewers earn their price? — especially as that price climbs: free tiers are tightening, and hosted reviews are creeping toward a dollar a pull request. When review gets expensive, running your own in CI (something like pr-agent, or a small harness of your own) becomes a real way to keep costs down — which turns “which model, at what effort, for how much time and money?” into a question worth measuring.
I don't expect this to be the last word. I just hope it's a useful reference for anyone wrestling with the same choice — read the results here, or fork it, add your own projects, and run it yourself.
an informal, open experiment · MIT-licensed