bughunt · an informal, honest benchmark

We plant known bugs in real code, then measure what AI code reviewers actually catch.

Across 3 projects we plant 36 bugs — 9 critical / 9 high / 9 medium / 9 low — plus safe-but-suspicious noise. Every tool gets the same prompt; findings are scored by an LLM judge, over 3 runs. It's an informal one-person experiment, so treat it as a directional signal — the numbers can shift as tools improve and as more projects are added.

0
models
under test
0
projects
different domains
0
planted bugs
across all projects
0
reviews
scored by the judge
How to read this
every run some runs never — per-bug stability across 3 runs
recall
— share of a project's planted bugs a tool found. The headline metric.
FP
— false positives per run — real code wrongly flagged. The second axis.
bonus
— extra real issues found beyond the planted set (not scored as recall).
native
— the tool ran its own review engine, not our shared prompt — so it's not directly comparable.
Caveats

Small, fully-readable projects. A single LLM judge. One language (Python) so far. Costs are API-equivalent list-price estimates, not billed spend. This is an informal, one-person experiment — a directional signal, not gospel.

Overall

All 3 projects, combined

Every project's planted bugs merged — 36 in total (9 critical, 9 high, 9 medium, 9 low). Each severity cell shows how many run-hits were found out of the possible total, pooled across the 3 projects; the overall column is mean recall.

criticalhighmediumlow overall FP bonus avg.speed $
1 composer-2.5-fast
25/2726/2725/2727/27 95% ·0.0 +12 52s $0.58
2 opus-4.8·max
24/2726/2725/2724/27 92% ·0.0 +7 225s $2.26
3 bugbot manual
24/2725/2724/2726/27 92% ·0.0 +7
4 opus-4.8·high
24/2727/2724/2723/27 91% ·0.0 +6 61s $0.87
5 opus-4.8·xhigh
24/2727/2724/2723/27 91% ·0.0 +5 132s $1.39
6 gpt-5.5·low
24/2727/2725/2721/27 90% ·0.2 +7 79s $0.63
7 opus-4.8·medium
23/2725/2724/2725/27 90% ·0.0 +6 52s $0.77
8 gpt-5.5·medium
24/2727/2725/2720/27 89% ·0.3 +7 102s $0.73
9 gpt-5.5·high
24/2726/2726/2719/27 88% ·0.0 +7 126s $0.88
10 gpt-5.5·xhigh
24/2727/2726/2718/27 88% ·0.4 +7 326s $1.67
11 opus-4.8·low
24/2722/2724/2724/27 87% ·0.2 +6 42s $0.67
12 coderabbit native
22/2719/2718/2722/27 75% ·0.1 +8 172s

severity cells: found / possible run-hits (planted × runs, all projects) · overall: mean recall · FP: mean/run · bonus & $: totals for one full pass · speed: mean s/run

measured · 36 configs × 3 runs · generated from harness/config.toml + harness/results

Per bug

Which bugs get found

Every planted bug (columns) against every configuration (rows), one project at a time. It shows the bugs the whole field misses, not just the totals.

config \\ bug C1 C2 C3 H1 H2 H3 M1 M2 M3 L1 L2 L3
composer-2.5-fast
gpt-5.5·high
⚠️
gpt-5.5·xhigh
gpt-5.5·low
⚠️
gpt-5.5·medium
⚠️
bugbot manual
⚠️
opus-4.8·medium
⚠️⚠️
opus-4.8·xhigh
⚠️
opus-4.8·max
⚠️
opus-4.8·high
coderabbit native
⚠️⚠️⚠️⚠️⚠️
opus-4.8·low
⚠️

✅ found every run · ⚠️ some runs · ❌ never · rows ranked by recall · hover a cell for the bug

Difficulty

Which bugs are hard for AI

Flip the axis: score the 36 planted bugs by how findable they are across every configuration. A bug is "found" by a config if it caught it in a majority of runs.

bugs, by how many of the 12 configs find them
1
0
1
2
3
1
4
5
1
6
7
8
3
9
2
10
5
11
23
12
0 = found by nobody 12 = found by everyone
the 10 hardest
C1 Discount percent not clamped: percent > 1 yields a negative price 3% 0/12
L2 Overly-broad except swallowing errors 31% 4/12
M2 Floating-point money arithmetic 44% 6/12
H3 free_slots uses `t <= day_end`, emitting a slot starting at/after the day end (off-by-one) 67% 9/12
L3 average_line divides by len with no guard for empty input (ZeroDivisionError) 72% 9/12
L3 busiest_hour calls max() on an empty schedule → ValueError 72% 9/12
M2 next_available treats a time touching a booking's end as busy (`<=`) → skips a valid slot 78% 10/12
L1 Loyalty threshold uses > instead of >= (spend exactly at a threshold misqualified) 81% 10/12
H3 Broken access control (IDOR) 89% 11/12
M3 monthly() adds timedelta(days=30) per step → drifts off the intended day of month 92% 11/12

% = mean caught-fraction across all configs · n/m = configs that found it (majority of runs)

Ensemble

What if you ran all of them?

Combining tools is a different question from ranking them. Counting a bug as found when a config catches it in a majority of runs, across all 36 planted bugs:

0%
best single tool
composer-2.5-fast — 34/36
0%
all tools combined
35/36 — the union of every catch
0
bug nobody finds
not caught by any config
smallest set of configs that reaches that combined coverage
composer-2.5-fast +34+ opus-4.8·medium +1
Effort

Recall vs reasoning effort

For each model that ran more than one effort tier, overall recall from its lowest setting to its highest. Hover a point for detail.

80%85%90%95%100%lowmediumhighxhighmaxgpt-5.5opus-4.8

measured · overall recall across all projects · axis truncated to show the band

Effort ROI

What the next tier buys

For each model with multiple effort tiers, the change at every step: how much recall you gain (in points), and what it costs in dollars and seconds.

gpt-5.5
stepΔ recallΔ $Δ speed
low → medium -1% +$0.03 +23s
medium → high -1% +$0.05 +24s
high → xhigh +0% +$0.26 +200s
opus-4.8
stepΔ recallΔ $Δ speed
low → medium +3% +$0.04 +10s
medium → high +1% +$0.03 +9s
high → xhigh +0% +$0.17 +71s
xhigh → max +1% +$0.29 +93s

Δ recall in points · costs are per-run estimates · green = gain, red = loss

Cost

Recall vs cost

Each configuration that reports a cost, plotted by estimated cost per run against overall recall. Tools that report no token usage (e.g. the manual and native reviewers) are omitted. Hover a point for detail.

80%85%90%95%100%$0.00$0.20$0.40$0.60$0.80estimated cost / run
gpt-5.5 opus-4.8 composer-2.5-fast

dashed line = Pareto frontier (most recall for the cost) · faded points are dominated · cost is an API-equivalent estimate, not billed spend

Efficiency

Recall per dollar

Overall recall divided by estimated cost per run — how much bug-finding each dollar buys. Tools with no cost reported are omitted.

1 composer-2.5-fast
4.96
2 gpt-5.5·low
4.24
3 opus-4.8·low
3.91
4 gpt-5.5·medium
3.67
5 opus-4.8·medium
3.49
6 opus-4.8·high
3.13
7 gpt-5.5·high
3.00
8 opus-4.8·xhigh
1.95
9 gpt-5.5·xhigh
1.58
10 opus-4.8·max
1.22

bar = recall ÷ $/run (higher is more bug-finding per dollar)

Precision

Recall vs false positives

The second axis. Overall recall against false positives per run — points on the left edge reported none. Hover a point for detail.

70%75%80%85%90%95%100%0.00.51.0false positives / run
gpt-5.5 opus-4.8 composer-2.5-fast coderabbit bugbot

measured · false positives = safe code wrongly flagged · axis truncated to show the band

AI-generated interpretation

The writeup below is RESULTS.md, rendered verbatim — it was written by Claude (an AI) from the measured data. The charts are the measurement; this reading is one informal interpretation and may change as tools improve and projects are added.

An informal, honest benchmark of AI code-review tools. Each project hides a known set of planted bugs, and every tool is scored by an LLM-as-judge (Claude Opus, headless) against the answer key. Recall = planted bugs found. FP = false positives per run. Costs are API-equivalent estimates (what the measured tokens would cost on the API), not actual subscription spend. Default 3 runs per config; a bug counts by stability (found every run / some / never). See harness/ for how to reproduce.

Subjects: codex gpt-5.5 (efforts low→xhigh), claude opus-4.8 (low→max), cursor-agent composer-2.5-fast, cursor bugbot (run manually), and CodeRabbit CLI. The coding models and Bugbot share one standard prompt; CodeRabbit runs its own review engine (marked native, and it reports no tokens so it has speed but no cost). 30 automated model configs × 3 runs, plus Bugbot and CodeRabbit × 3 projects.

Caveats: small, fully-readable projects; a single LLM judge; one language (Python) so far; native tools aren't directly comparable on the prompt axis; costs are list-price estimates, not billed spend. Treat as a directional signal, not gospel.


Project 1 — python-basic (textbook web-backend bugs)

12 planted bugs: SQLi, command injection, pickle RCE, path traversal, IDOR, unsalted MD5, TOCTOU transfer, float-for-money, off-by-one pagination, mutable default, broad-except, is vs ==.

python-basic recall · FP · bonus · speed · $/run · n=3
1 composer-2.5-fast
100% ·0.0 +8.0 62s $0.23
2 gpt-5.5·high
94% ·0.0 +5.7 114s $0.27
3 gpt-5.5·xhigh
92% ·1.3 +4.0 278s $0.67
4 gpt-5.5·low
89% ·0.3 +4.3 77s $0.19
5 gpt-5.5·medium
89% ·0.7 +3.7 92s $0.23
6 bugbot manual
89% ·0.0 +4.3
7 opus-4.8·medium
86% ·0.0 +4.3 44s $0.25
8 opus-4.8·xhigh
86% ·0.0 +3.3 115s $0.47
9 opus-4.8·max
86% ·0.0 +4.3 202s $0.67
10 opus-4.8·high
83% ·0.0 +4.0 59s $0.31
11 coderabbit native
83% ·0.0 +6.7 213s
12 opus-4.8·low
78% ·0.7 +5.0 36s $0.21

Findings: near-saturated at the top — the famous vulnerabilities (SQLi, command injection, pickle, path traversal, MD5, TOCTOU) are caught 3/3 by essentially everyone. The whole spread comes from two non-flashy bugs: float-for-money (opus missed it almost entirely — 0/3 at every effort except one lucky max run; codex needed high+ to lock it) and broad-except (only composer got it 3/3; most configs never did). Effort plateaus or inverts — codex peaks at high (94%) and drops at xhigh (92%, and picks up 1.3 FP). composer reaches 100% for $0.23 while opus can't clear 86% even at max ($0.67). CodeRabbit, the one purpose-built reviewer, lands at 83% (2nd from bottom) — though it flags the most extra real issues of anyone (bonus 6.7), so it's thorough but not accurate.


Project 2 — python-pricing (subtle money-math correctness)

12 planted bugs, all subtle correctness, no famous vulns: unclamped discount → negative price, per-line coupon over-charge, tax on the wrong base, float-money refund, /30-vs-days_in_month proration, tier boundary off-by-one, truncate-vs- round, wrong discount base, dropped remainder, > vs >=, cents truncation, empty-input crash.

python-pricing recall · FP · bonus · speed · $/run · n=3
1 composer-2.5-fast
94% ·0.0 +2.7 36s $0.13
2 opus-4.8·low
92% ·0.0 +1.3 46s $0.23
3 opus-4.8·max
92% ·0.0 +2.0 276s $0.93
4 bugbot manual
92% ·0.0 +1.3
5 gpt-5.5·medium
89% ·0.3 +0.7 118s $0.27
6 opus-4.8·medium
89% ·0.0 +1.3 55s $0.28
7 opus-4.8·high
89% ·0.0 +1.3 66s $0.31
8 opus-4.8·xhigh
86% ·0.0 +1.0 154s $0.48
9 gpt-5.5·low
83% ·0.0 +1.0 87s $0.27
10 gpt-5.5·high
83% ·0.0 +0.0 137s $0.34
11 gpt-5.5·xhigh
83% ·0.0 +0.7 232s $0.56
12 coderabbit native
75% ·0.3 +0.3 180s

Findings:

  1. One bug beats everybody. The unclamped discount that yields a negative price (C1) was missed 0/3 by every codex and every opus config, by Bugbot, and by CodeRabbit; only composer caught it, and only 1/3. It's the single hardest item in the suite — plausibly because "should this function validate its input?" is a judgment call, not a mechanical defect.
  2. Effort barely moves it — and isn't monotonic. codex is 83% at low, high, and xhigh (with a 89% blip at medium); opus low (92%) ties opus max (92%) — for 4× the cost ($0.23 vs $0.93). Reasoning depth buys nothing here.
  3. codex has a crash blind spot. The empty-input ZeroDivisionError (L3) it caught at low/medium but missed 0/3 at high and xhigh — more effort, fewer catches.
  4. CodeRabbit comes last (75%) — the specialist reviewer is out-recalled by every general model configuration on the subtle-money project.

Project 3 — python-scheduling (date/time & calendar correctness)

12 planted bugs: one-directional overlap, cross-date conflict miss, naive-vs-UTC comparison, timedelta.seconds (drops .days), off-by-one recurrence, trailing free-slot, no range validation, touching-end busy, 30-day "months", insertion-order "first", inclusive-end contains, empty-schedule max() crash.

python-scheduling recall · FP · bonus · speed · $/run · n=3
1 opus-4.8·high
100% ·0.0 +0.7 59s $0.26
2 opus-4.8·xhigh
100% ·0.0 +1.0 127s $0.44
3 gpt-5.5·low
97% ·0.3 +1.7 72s $0.17
4 opus-4.8·max
97% ·0.0 +0.7 197s $0.66
5 opus-4.8·medium
94% ·0.0 +0.0 56s $0.25
6 bugbot manual
94% ·0.0 +1.7
7 composer-2.5-fast
92% ·0.0 +1.7 59s $0.21
8 opus-4.8·low
92% ·0.0 +0.0 45s $0.23
9 gpt-5.5·medium
89% ·0.0 +2.3 97s $0.22
10 gpt-5.5·xhigh
89% ·0.0 +2.7 467s $0.45
11 gpt-5.5·high
86% ·0.0 +1.3 126s $0.27
12 coderabbit native
67% ·0.0 +1.0 124s

Findings: the one project where opus leads — and the one where effort actually helps, but only up to high (low 92% → high 100%), then plateaus and dips at max (97%). For codex it inverts hard: low scores 97% at $0.17 in 72s, while xhigh scores 89% at $0.45 in 467s — six times slower and worse. codex also has a blind spot on the empty-schedule crash (L3): 0/3 at medium, high, and xhigh, while opus, composer, and bugbot all catch it. The trailing free-slot off-by-one (H3) is flaky for everyone — the genuine coin-flip of the set. CodeRabbit is last by a clear margin (67%) — nearly 20 points below the field.


What it all says

  • composer-2.5-fast is the value standout — 🥇 on basic and pricing, mid-pack on scheduling (92%), 0.0 false positives on all three projects, at $0.13–0.23/run. On this evidence it's the pick for routine bug-finding, and remarkable for its price.
  • The purpose-built review products don't win. CodeRabbit — a dedicated AI code reviewer — lands last or 2nd-from-last on all three (83 / 75 / 67%), and Bugbot, though clean, never tops a project either. The general coding models and agents out-recall the specialist tools. And because the whole (small) project is in front of every tool, this is a reasoning gap, not a retrieval one — a tool that misses a bug it can fully see won't do better on a larger codebase, only worse.
  • bugbot is the precision play — never wins a project (89 / 92 / 94%) but never adds a single false positive, and its extras are real. If a noisy reviewer is worse than a quiet one for your workflow, that profile matters more than raw recall. (CodeRabbit is the opposite trade: more extras, lower recall.)
  • No model wins everywhere. composer takes the two correctness-heavy projects; opus owns date/time. Match the tool to the domain.
  • Effort ≠ care, and often ≠ value. Higher reasoning effort helped only on scheduling, only up to high. Elsewhere it was flat (opus low = max on pricing) or inverted (codex low beat xhigh on scheduling; codex lost crash catches at high/xhigh). The xhigh/max tiers almost never earned their 3–5× cost or their multi-minute latency. Thoroughness reads as a model property, not a knob you can turn up.
  • The discriminating axis is subtle correctness, not exotic topics. Famous vulnerabilities (basic) near-saturate; famous patterns don't separate tools at all (an earlier concurrency project scored everyone ~100% and was scrapped). The sharpest single discriminators here were the quiet ones: a discount that goes negative, a swallowed exception, a float where cents belong.
  • False positives stayed rare and small. Almost every FP > 0 came from codex — spread across low, medium, and xhigh (up to 1.3 on basic xhigh), not clustered at any one tier — plus a single opus low run and one CodeRabbit pricing run (0.3). composer and bugbot never produced one across all three projects.

Method, provenance, and how to add tools/projects: harness/docs/ and harness/REFERENCES.md. Full per-bug tables: harness/reports/summary.md.

Why this exists

I care about code review. As AI writes more of the code — and writes it faster — picking the right reviewer has quietly become the thing I spend the most effort on, and the thing I could find the least honest data about. Almost every tool claims higher recall and less noise than the rest; almost none of them show the numbers.

This isn't a formal or authoritative benchmark, and it may not discriminate as sharply as I'd like. But even a small, open experiment starts to surface the questions that actually matter: does more reasoning effort really help? Do the purpose-built reviewers earn their price? — especially as that price climbs: free tiers are tightening, and hosted reviews are creeping toward a dollar a pull request. When review gets expensive, running your own in CI (something like pr-agent, or a small harness of your own) becomes a real way to keep costs down — which turns “which model, at what effort, for how much time and money?” into a question worth measuring.

I don't expect this to be the last word. I just hope it's a useful reference for anyone wrestling with the same choice — read the results here, or fork it, add your own projects, and run it yourself.

an informal, open experiment · MIT-licensed