bughunt · an informal, honest benchmark

We plant known bugs in real code, then measure
what AI code reviewers actually catch.

Across 3 projects we plant 36 bugs — 9 critical / 9 high / 9 medium / 9 low — plus safe-but-suspicious noise. Every tool gets the same prompt; findings are scored by an LLM judge, over 3 runs. It's an informal one-person experiment, so treat it as a directional signal — the numbers can shift as tools improve and as more projects are added.

models

under test

projects

different domains

planted bugs

across all projects

reviews

scored by the judge

How to read this

every run some runs never — per-bug stability across 3 runs

recall: — share of a project's planted bugs a tool found. The headline metric.
FP: — false positives per run — real code wrongly flagged. The second axis.
bonus: — extra real issues found beyond the planted set (not scored as recall).
native: — the tool ran its own review engine, not our shared prompt — so it's not directly comparable.

Caveats

Small, fully-readable projects. A single LLM judge. One language (Python) so far. Costs are API-equivalent list-price estimates, not billed spend. This is an informal, one-person experiment — a directional signal, not gospel.

Overall

All 3 projects, combined

Every project's planted bugs merged — 36 in total (9 critical, 9 high, 9 medium, 9 low). Each severity cell shows how many run-hits were found out of the possible total, pooled across the 3 projects; the overall column is mean recall.

criticalhighmediumlow overall FP bonus avg.speed $

1 composer-2.5-fast

25/2726/2725/2727/27 95% ·0.0 +12 52s $0.58

2 opus-4.8·max

24/2726/2725/2724/27 92% ·0.0 +7 225s $2.26

3 bugbot manual

24/2725/2724/2726/27 92% ·0.0 +7 — —

4 opus-4.8·high

24/2727/2724/2723/27 91% ·0.0 +6 61s $0.87

5 opus-4.8·xhigh

24/2727/2724/2723/27 91% ·0.0 +5 132s $1.39

6 gpt-5.5·low

24/2727/2725/2721/27 90% ·0.2 +7 79s $0.63

7 opus-4.8·medium

23/2725/2724/2725/27 90% ·0.0 +6 52s $0.77

8 gpt-5.5·medium

24/2727/2725/2720/27 89% ·0.3 +7 102s $0.73

9 gpt-5.5·high

24/2726/2726/2719/27 88% ·0.0 +7 126s $0.88

10 gpt-5.5·xhigh

24/2727/2726/2718/27 88% ·0.4 +7 326s $1.67

11 opus-4.8·low

24/2722/2724/2724/27 87% ·0.2 +6 42s $0.67

12 coderabbit native

22/2719/2718/2722/27 75% ·0.1 +8 172s —

severity cells: found / possible run-hits (planted × runs, all projects) · overall: mean recall · FP: mean/run · bonus & $: totals for one full pass · speed: mean s/run

measured · 36 configs × 3 runs · generated from harness/config.toml + harness/results

Per bug

Which bugs get found

Every planted bug (columns) against every configuration (rows), one project at a time. It shows the bugs the whole field misses, not just the totals.

config \\ bug C1 C2 C3 H1 H2 H3 M1 M2 M3 L1 L2 L3

composer-2.5-fast

✅✅✅✅✅✅✅✅✅✅✅✅

gpt-5.5·high

✅✅✅✅✅✅✅✅✅✅⚠️✅

gpt-5.5·xhigh

✅✅✅✅✅✅✅✅✅✅❌✅

gpt-5.5·low

✅✅✅✅✅✅✅⚠️✅✅❌✅

gpt-5.5·medium

✅✅✅✅✅✅✅⚠️✅✅❌✅

bugbot manual

✅✅✅✅✅✅✅❌✅✅⚠️✅

opus-4.8·medium

✅✅✅✅✅⚠️✅❌✅✅⚠️✅

opus-4.8·xhigh

✅✅✅✅✅✅✅❌✅✅⚠️✅

opus-4.8·max

✅✅✅✅✅✅✅⚠️✅✅❌✅

opus-4.8·high

✅✅✅✅✅✅✅❌✅✅❌✅

coderabbit native

✅✅✅⚠️✅⚠️⚠️⚠️✅✅⚠️✅

opus-4.8·low

✅✅✅✅✅⚠️✅❌✅✅❌✅

✅ found every run · ⚠️ some runs · ❌ never · rows ranked by recall · hover a cell for the bug

Difficulty

Which bugs are hard for AI

Flip the axis: score the 36 planted bugs by how findable they are across every configuration. A bug is "found" by a config if it caught it in a majority of runs.

bugs, by how many of the 12 configs find them

0 = found by nobody 12 = found by everyone

the 10 hardest

C1 Discount percent not clamped: percent > 1 yields a negative price 3% 0/12

L2 Overly-broad except swallowing errors 31% 4/12

M2 Floating-point money arithmetic 44% 6/12

H3 free_slots uses `t <= day_end`, emitting a slot starting at/after the day end (off-by-one) 67% 9/12

L3 average_line divides by len with no guard for empty input (ZeroDivisionError) 72% 9/12

L3 busiest_hour calls max() on an empty schedule → ValueError 72% 9/12

M2 next_available treats a time touching a booking's end as busy (`<=`) → skips a valid slot 78% 10/12

L1 Loyalty threshold uses > instead of >= (spend exactly at a threshold misqualified) 81% 10/12

H3 Broken access control (IDOR) 89% 11/12

M3 monthly() adds timedelta(days=30) per step → drifts off the intended day of month 92% 11/12

% = mean caught-fraction across all configs · n/m = configs that found it (majority of runs)

Ensemble

What if you ran all of them?

Combining tools is a different question from ranking them. Counting a bug as found when a config catches it in a majority of runs, across all 36 planted bugs:

best single tool

composer-2.5-fast — 34/36

all tools combined

35/36 — the union of every catch

bug nobody finds

not caught by any config

smallest set of configs that reaches that combined coverage

composer-2.5-fast +34+ opus-4.8·medium +1

Effort

Recall vs reasoning effort

For each model that ran more than one effort tier, overall recall from its lowest setting to its highest. Hover a point for detail.

measured · overall recall across all projects · axis truncated to show the band

Effort ROI

What the next tier buys

For each model with multiple effort tiers, the change at every step: how much recall you gain (in points), and what it costs in dollars and seconds.

gpt-5.5

stepΔ recallΔ $Δ speed

low → medium -1% +$0.03 +23s

medium → high -1% +$0.05 +24s

high → xhigh +0% +$0.26 +200s

opus-4.8

stepΔ recallΔ $Δ speed

low → medium +3% +$0.04 +10s

medium → high +1% +$0.03 +9s

high → xhigh +0% +$0.17 +71s

xhigh → max +1% +$0.29 +93s

Δ recall in points · costs are per-run estimates · green = gain, red = loss

Cost

Recall vs cost

Each configuration that reports a cost, plotted by estimated cost per run against overall recall. Tools that report no token usage (e.g. the manual and native reviewers) are omitted. Hover a point for detail.

gpt-5.5 opus-4.8 composer-2.5-fast

dashed line = Pareto frontier (most recall for the cost) · faded points are dominated · cost is an API-equivalent estimate, not billed spend

Efficiency

Recall per dollar

Overall recall divided by estimated cost per run — how much bug-finding each dollar buys. Tools with no cost reported are omitted.

1 composer-2.5-fast

4.96

2 gpt-5.5·low

4.24

3 opus-4.8·low

3.91

4 gpt-5.5·medium

3.67

5 opus-4.8·medium

3.49

6 opus-4.8·high

3.13

7 gpt-5.5·high

3.00

8 opus-4.8·xhigh

1.95

9 gpt-5.5·xhigh

1.58

10 opus-4.8·max

1.22

bar = recall ÷ $/run (higher is more bug-finding per dollar)

Precision

Recall vs false positives

The second axis. Overall recall against false positives per run — points on the left edge reported none. Hover a point for detail.

gpt-5.5 opus-4.8 composer-2.5-fast coderabbit bugbot

measured · false positives = safe code wrongly flagged · axis truncated to show the band

AI-generated interpretation

The writeup below is RESULTS.md, rendered verbatim — it was written by Claude (an AI) from the measured data. The charts are the measurement; this reading is one informal interpretation and may change as tools improve and projects are added.

An informal, honest benchmark of AI code-review tools. Each project hides a known set of planted bugs, and every tool is scored by an LLM-as-judge (Claude Opus, headless) against the answer key. Recall = planted bugs found. FP = false positives per run. Costs are API-equivalent estimates (what the measured tokens would cost on the API), not actual subscription spend. Default 3 runs per config; a bug counts by stability (found every run / some / never). See harness/ for how to reproduce.

Subjects: codex gpt-5.5 (efforts low→xhigh), claude opus-4.8 (low→max), cursor-agent composer-2.5-fast, cursor bugbot (run manually), and CodeRabbit CLI. The coding models and Bugbot share one standard prompt; CodeRabbit runs its own review engine (marked native, and it reports no tokens so it has speed but no cost). 30 automated model configs × 3 runs, plus Bugbot and CodeRabbit × 3 projects.

Caveats: small, fully-readable projects; a single LLM judge; one language (Python) so far; native tools aren't directly comparable on the prompt axis; costs are list-price estimates, not billed spend. Treat as a directional signal, not gospel.

Project 1 — `python-basic` (textbook web-backend bugs)

12 planted bugs: SQLi, command injection, pickle RCE, path traversal, IDOR, unsalted MD5, TOCTOU transfer, float-for-money, off-by-one pagination, mutable default, broad-except, is vs ==.

python-basic recall · FP · bonus · speed · $/run · n=3

1 composer-2.5-fast

100% ·0.0 +8.0 62s $0.23

2 gpt-5.5·high

94% ·0.0 +5.7 114s $0.27

3 gpt-5.5·xhigh

92% ·1.3 +4.0 278s $0.67

4 gpt-5.5·low

89% ·0.3 +4.3 77s $0.19

5 gpt-5.5·medium

89% ·0.7 +3.7 92s $0.23

6 bugbot manual

89% ·0.0 +4.3 — —

7 opus-4.8·medium

86% ·0.0 +4.3 44s $0.25

8 opus-4.8·xhigh

86% ·0.0 +3.3 115s $0.47

9 opus-4.8·max

86% ·0.0 +4.3 202s $0.67

10 opus-4.8·high

83% ·0.0 +4.0 59s $0.31

11 coderabbit native

83% ·0.0 +6.7 213s —

12 opus-4.8·low

78% ·0.7 +5.0 36s $0.21

Findings: near-saturated at the top — the famous vulnerabilities (SQLi, command injection, pickle, path traversal, MD5, TOCTOU) are caught 3/3 by essentially everyone. The whole spread comes from two non-flashy bugs: float-for-money (opus missed it almost entirely — 0/3 at every effort except one lucky max run; codex needed high+ to lock it) and broad-except (only composer got it 3/3; most configs never did). Effort plateaus or inverts — codex peaks at high (94%) and drops at xhigh (92%, and picks up 1.3 FP). composer reaches 100% for $0.23 while opus can't clear 86% even at max ($0.67). CodeRabbit, the one purpose-built reviewer, lands at 83% (2nd from bottom) — though it flags the most extra real issues of anyone (bonus 6.7), so it's thorough but not accurate.

Project 2 — `python-pricing` (subtle money-math correctness)

12 planted bugs, all subtle correctness, no famous vulns: unclamped discount → negative price, per-line coupon over-charge, tax on the wrong base, float-money refund, /30-vs-days_in_month proration, tier boundary off-by-one, truncate-vs- round, wrong discount base, dropped remainder, > vs >=, cents truncation, empty-input crash.

python-pricing recall · FP · bonus · speed · $/run · n=3

1 composer-2.5-fast

94% ·0.0 +2.7 36s $0.13

2 opus-4.8·low

92% ·0.0 +1.3 46s $0.23

3 opus-4.8·max

92% ·0.0 +2.0 276s $0.93

4 bugbot manual

92% ·0.0 +1.3 — —

5 gpt-5.5·medium

89% ·0.3 +0.7 118s $0.27

6 opus-4.8·medium

89% ·0.0 +1.3 55s $0.28

7 opus-4.8·high

89% ·0.0 +1.3 66s $0.31

8 opus-4.8·xhigh

86% ·0.0 +1.0 154s $0.48

9 gpt-5.5·low

83% ·0.0 +1.0 87s $0.27

10 gpt-5.5·high

83% ·0.0 +0.0 137s $0.34

11 gpt-5.5·xhigh

83% ·0.0 +0.7 232s $0.56

12 coderabbit native

75% ·0.3 +0.3 180s —

Findings:

One bug beats everybody. The unclamped discount that yields a negative price (C1) was missed 0/3 by every codex and every opus config, by Bugbot, and by CodeRabbit; only composer caught it, and only 1/3. It's the single hardest item in the suite — plausibly because "should this function validate its input?" is a judgment call, not a mechanical defect.
Effort barely moves it — and isn't monotonic. codex is 83% at low, high, and xhigh (with a 89% blip at medium); opus low (92%) ties opus max (92%) — for 4× the cost ($0.23 vs $0.93). Reasoning depth buys nothing here.
codex has a crash blind spot. The empty-input ZeroDivisionError (L3) it caught at low/medium but missed 0/3 at high and xhigh — more effort, fewer catches.
CodeRabbit comes last (75%) — the specialist reviewer is out-recalled by every general model configuration on the subtle-money project.

Project 3 — `python-scheduling` (date/time & calendar correctness)

12 planted bugs: one-directional overlap, cross-date conflict miss, naive-vs-UTC comparison, timedelta.seconds (drops .days), off-by-one recurrence, trailing free-slot, no range validation, touching-end busy, 30-day "months", insertion-order "first", inclusive-end contains, empty-schedule max() crash.

python-scheduling recall · FP · bonus · speed · $/run · n=3

1 opus-4.8·high

100% ·0.0 +0.7 59s $0.26

2 opus-4.8·xhigh

100% ·0.0 +1.0 127s $0.44

3 gpt-5.5·low

97% ·0.3 +1.7 72s $0.17

4 opus-4.8·max

97% ·0.0 +0.7 197s $0.66

5 opus-4.8·medium

94% ·0.0 +0.0 56s $0.25

6 bugbot manual

94% ·0.0 +1.7 — —

7 composer-2.5-fast

92% ·0.0 +1.7 59s $0.21

8 opus-4.8·low

92% ·0.0 +0.0 45s $0.23

9 gpt-5.5·medium

89% ·0.0 +2.3 97s $0.22

10 gpt-5.5·xhigh

89% ·0.0 +2.7 467s $0.45

11 gpt-5.5·high

86% ·0.0 +1.3 126s $0.27

12 coderabbit native

67% ·0.0 +1.0 124s —

Findings: the one project where opus leads — and the one where effort actually helps, but only up to high (low 92% → high 100%), then plateaus and dips at max (97%). For codex it inverts hard: low scores 97% at $0.17 in 72s, while xhigh scores 89% at $0.45 in 467s — six times slower and worse. codex also has a blind spot on the empty-schedule crash (L3): 0/3 at medium, high, and xhigh, while opus, composer, and bugbot all catch it. The trailing free-slot off-by-one (H3) is flaky for everyone — the genuine coin-flip of the set. CodeRabbit is last by a clear margin (67%) — nearly 20 points below the field.

What it all says

composer-2.5-fast is the value standout — 🥇 on basic and pricing, mid-pack on scheduling (92%), 0.0 false positives on all three projects, at $0.13–0.23/run. On this evidence it's the pick for routine bug-finding, and remarkable for its price.
The purpose-built review products don't win. CodeRabbit — a dedicated AI code reviewer — lands last or 2nd-from-last on all three (83 / 75 / 67%), and Bugbot, though clean, never tops a project either. The general coding models and agents out-recall the specialist tools. And because the whole (small) project is in front of every tool, this is a reasoning gap, not a retrieval one — a tool that misses a bug it can fully see won't do better on a larger codebase, only worse.
bugbot is the precision play — never wins a project (89 / 92 / 94%) but never adds a single false positive, and its extras are real. If a noisy reviewer is worse than a quiet one for your workflow, that profile matters more than raw recall. (CodeRabbit is the opposite trade: more extras, lower recall.)
No model wins everywhere. composer takes the two correctness-heavy projects; opus owns date/time. Match the tool to the domain.
Effort ≠ care, and often ≠ value. Higher reasoning effort helped only on scheduling, only up to high. Elsewhere it was flat (opus low = max on pricing) or inverted (codex low beat xhigh on scheduling; codex lost crash catches at high/xhigh). The xhigh/max tiers almost never earned their 3–5× cost or their multi-minute latency. Thoroughness reads as a model property, not a knob you can turn up.
The discriminating axis is subtle correctness, not exotic topics. Famous vulnerabilities (basic) near-saturate; famous patterns don't separate tools at all (an earlier concurrency project scored everyone ~100% and was scrapped). The sharpest single discriminators here were the quiet ones: a discount that goes negative, a swallowed exception, a float where cents belong.
False positives stayed rare and small. Almost every FP > 0 came from codex — spread across low, medium, and xhigh (up to 1.3 on basic xhigh), not clustered at any one tier — plus a single opus low run and one CodeRabbit pricing run (0.3). composer and bugbot never produced one across all three projects.

Method, provenance, and how to add tools/projects: harness/docs/ and harness/REFERENCES.md. Full per-bug tables: harness/reports/summary.md.

Why this exists

I care about code review. As AI writes more of the code — and writes it faster — picking the right reviewer has quietly become the thing I spend the most effort on, and the thing I could find the least honest data about. Almost every tool claims higher recall and less noise than the rest; almost none of them show the numbers.

This isn't a formal or authoritative benchmark, and it may not discriminate as sharply as I'd like. But even a small, open experiment starts to surface the questions that actually matter: does more reasoning effort really help? Do the purpose-built reviewers earn their price? — especially as that price climbs: free tiers are tightening, and hosted reviews are creeping toward a dollar a pull request. When review gets expensive, running your own in CI (something like pr-agent, or a small harness of your own) becomes a real way to keep costs down — which turns “which model, at what effort, for how much time and money?” into a question worth measuring.

I don't expect this to be the last word. I just hope it's a useful reference for anyone wrestling with the same choice — read the results here, or fork it, add your own projects, and run it yourself.

an informal, open experiment · MIT-licensed

We plant known bugs in real code, then measure what AI code reviewers actually catch.

Project 1 — python-basic (textbook web-backend bugs)

Project 2 — python-pricing (subtle money-math correctness)

Project 3 — python-scheduling (date/time & calendar correctness)

What it all says

We plant known bugs in real code, then measure
what AI code reviewers actually catch.

Project 1 — `python-basic` (textbook web-backend bugs)

Project 2 — `python-pricing` (subtle money-math correctness)

Project 3 — `python-scheduling` (date/time & calendar correctness)