Why does Claude say bugs are fixed when they're not?

Claude doesn't run your app the way you do. It pattern-matches on what fixes for similar bugs look like in its training data, writes the change, and assumes success. Without a reproducible test that returns pass or fail, the model has no way to verify its own work. The fix: give Claude a measurable repro and require it to confirm the test still fails before fixing, then passes after.

What is a "repro" in debugging?

A repro is a documented case that reliably triggers a bug. It has three parts: the exact steps to perform, the expected behavior, and the actual broken behavior. A good repro can be run by anyone (or any AI agent) and produces a clear pass/fail signal. Without a repro, you're debugging from memory and guesses — which is why AI tools like Claude struggle without one.

How do I narrow down a bug to find its root cause?

Use methodical deletion. Verify the bug exists, remove a chunk of code, run the repro again. If the bug is still there, commit and delete more. If the bug disappears, roll back and try smaller deletions. Keep shrinking until the bug isolates itself — at that point the root cause is usually obvious because it's the only thing left.

Should I trust Claude Code to fix bugs autonomously?

Only when you've given it a verifiable test. Claude Code is excellent at executing well-defined steps — including running tests, deleting code, and reporting results. It's poor at judging whether a fix actually worked when the only evidence is "the error message is gone." Always pair autonomous fixing with a repro that asserts the desired behavior, not just the absence of the previous error.

How long should I spend on a bug before building a formal repro?

Two failed fix attempts. That's the threshold. If you've patched twice or Claude has swung twice without resolving it, stop improvising and write the repro. The hour spent formalizing the bug almost always beats the four-to-six hours spent guessing your way through it.

Back to writing

AI · Claude · Claude Code

How to Fix Any Bug: The Repro-First Method (When Claude Can't)

By Abdulkader Safi / May 28, 2026 / 7 min read

Claude keeps declaring bugs 'fixed' that aren't. Here's the repro-first debugging method I use to actually nail root causes — works for AI and humans.

Developer running a reproducible test to debug a bug Claude couldn't fix

On this page

Claude told me the bug was fixed. It wasn't.

Then it told me again. Still broken. Third time — broken. By round four I stopped reading the explanations and just hit "no, try again" like a slot machine player chasing a win that wasn't coming.

This is the actual problem with vibe coding nobody wants to say out loud: the AI has no way to know if its fix worked. It guesses, ships, declares victory. You're left holding the broken app.

Dan Abramov wrote a great piece walking through how he debugged a React Router scroll bug that defeated Claude six times in a row. The method he used is the same one I now run on every bug that survives more than two AI attempts. Here's the playbook — slightly remixed for builders who live inside Claude Code.

Why Claude Keeps Pretending Bugs Are Fixed

Dan put it bluntly: "Claude was repeatedly wrong because it didn't have a repro." That's the whole game right there.

Without a reliable test that returns bug present or bug gone, the model is just pattern-matching against your codebase and hoping. It reads the code, sees something that "looks like" a fix in its training data, writes the change, and moves on. No verification. No proof.

I see this constantly when running Claude Code workflows. The model is brilliant at generating code that might work — and terrible at confirming it does. That's not Claude's fault. That's how LLMs work. They predict tokens. They don't run your app.

Your job, as the human in the loop, is to give Claude something it can actually verify against.

That something is called a repro.

Step 1: Build a Repro That Can Actually Lie to You

A repro has three parts. Skip any one and you've got nothing:

What to do — the exact steps to trigger the bug
What should happen — the expected behavior
What actually happens — the broken behavior

The trap most people fall into? Building a repro that only they can read.

Dan's first repro was "watch the page jitter when you click the button." Useless. Claude can't see jitter. I can't automate jitter. You can't measure jitter at 3am when you're tired and your eyes are crossed.

So he converted it to something measurable: compare the scroll position before and after the click. A number. Numbers don't lie. Numbers can be checked by a script. Numbers can be checked by Claude.

The rule: if your repro can't be confirmed by code you didn't write, it's not a repro yet. Keep cooking until it is.

How do I know a bug needs a formal repro vs. just a quick fix?

Easy heuristic — two failed attempts and you're done improvising. If Claude has swung at it twice and missed, or you've patched twice and the bug returned, stop. Build the repro. The hour you "lose" writing the test is the hour you'd otherwise lose on attempts four, five, six, and seven.

Step 2: Shrink the Repro Without Breaking It

Once you've got a working repro, your next move is to make it smaller. Faster to run. Fewer moving parts. This is where most people accidentally destroy the case they were building.

Here's the danger Dan flags — and it bit me hard on a Laravel API bug last year. You start removing pieces of the setup, the test still fails, you think "great, that part wasn't relevant." But maybe what you're now seeing is a different bug. A new one. The original is still hiding under everything you stripped away.

The fix is dead simple: after every narrowing step, prove that positive results are still reachable. If your repro is "scroll position doesn't change," then verify that removing the suspected trigger makes scroll position change. If it does, your narrowed repro is still about the original bug. If it doesn't — congratulations, you've been chasing a ghost for the last 20 minutes. Roll back.

I make Claude do this verification step explicitly now. Two prompts:

1. Run the repro. Confirm bug still reproduces.
2. Apply the suspected fix. Confirm bug is now gone.
   If "gone" doesn't actually mean fixed (e.g. the test just stopped running),
   stop and tell me.

That second instruction stops Claude from pulling the "I deleted the test, so the test passes" move. Which, yes, it will do.

Step 3: Delete Code Until the Bug Sits Alone in a Room

This is the part nobody likes. It feels destructive. It is.

You repeat this loop:

Verify the bug still happens
Delete a chunk of code
Run the repro
If the bug is still there → commit, delete more
If the bug vanished → reset, delete smaller chunks instead

Dan calls this "well-founded recursion" — you must always be making measurable progress toward a smaller codebase that still has the bug. No skipping. No "let me try a theory first." Theories are how Claude gets confused. Theories are how you spend Saturday on a wild goose chase.

I've watched Claude blow this step every single time I let it improvise. It wants to add code to test theories. New console.log statements. New try/catch blocks. New observability wrappers. None of that shrinks the repro — it grows the haystack.

When I run this step with Claude, the system prompt is roughly:

Your only job right now is to delete code while the repro still fails.
You may not add code. You may not refactor. You may not write logs.
You may rename or comment out, but every action must reduce code surface.
After each deletion, run the repro and report pass/fail.
Stop when no further deletions are possible without losing the bug.

That constraint flips Claude from a brainstormer into a methodical cutter. Which is what you need.

This connects directly to a point I made in Prompt vs Context Engineering — the model isn't dumb, it's just doing whatever your prompt rewards. Reward deletion, you get deletion. Reward "looks like a fix," you get fake fixes.

Step 4: When the Bug Is Alone, the Root Cause Is Obvious

Dan's bug — after all the deletion — came down to React Router's ScrollRestoration component firing on revalidation instead of route change. An upstream bug. Already fixed in a newer version. Two-minute upgrade.

That's the magic of the method. Once you've stripped the codebase to the smallest possible repro, the root cause stops hiding. It's the only thing left in the room with you.

I've now hit this pattern on six different production bugs across client work — three in Laravel, two in React Native, one in a SvelteKit edge function. Every single time, the "obvious" first guess was wrong, and the actual root cause only surfaced after the repro had been shrunk to maybe 40 lines total.

If you want to see what disciplined AI debugging looks like at full scale on a complex stack, the team at DSRPT walks through this exact repro-first pattern on enterprise builds where you can't just "try a fix in prod." The principles are identical — formalize the bug before you touch the code.

What if my bug only reproduces in production?

You still need a repro. It just lives in a different place. Capture the exact request/response, the user state, the database snapshot, and the timing. Then build a local environment that replays all of that. Yes, it's annoying. No, you cannot skip it. Production-only bugs eat weeks if you guess; they eat hours if you can replay them locally.

What Most Vibe Coders Get Wrong About Debugging With AI

Hard truth from running Claude Code workflows with skills on real client projects:

→ You are not lazy if the AI is fixing your bug. You are lazy if you're not verifying the fix. → "Looks fixed" is not fixed. A passing test is fixed. A green repro is fixed. Vibes are not fixed. → Claude is great at the boring 90%. It's the methodical 10% — the disciplined deletion — where it needs you driving.

I rant about this in the vibe coding lie — the tools aren't the problem, the workflow is. Repro-first is the workflow that makes the tools actually work.

Can I just ask Claude to write the repro for me?

Yes — and you should. Here's the prompt I use, copy-paste:

Bug description: [paste it]
Don't fix the bug yet. First, write me a repro that:
1. Sets up the minimum state to trigger it
2. Returns a clear pass/fail result (boolean, exit code, or assertion)
3. Doesn't depend on me visually inspecting anything
Run the repro and confirm it currently fails.
Then stop and wait for my next instruction.

That last sentence is the most important one. "Stop and wait" prevents the cowboy mode where Claude builds the repro, immediately invents a fix, runs the new test (which it conveniently broke), and declares victory.

What to Do Right Now

Pick the bug that's been mocking you all week. The one you've already "fixed" twice. Walk through these four steps:

Write the repro down — 3 sentences: do this, expect this, get this instead.
Convert "expect this" into something a script could check. A number. A boolean. A string match.
Shrink the setup until it's the smallest thing that still fails.
Delete code methodically until the bug isolates itself.

That's it. That's the whole method. It's older than AI. It works better with AI when you actually run it instead of letting the model fake it.

If you found this useful, the Claude Code best practices guide is where I keep the live version of all the prompts and skills I'm running on production bugs. Bookmark it, steal what you need, and stop letting Claude lie to you about what it just shipped.

Abdulkader Safi Software Engineer & AI Automation Builder

Building scalable systems and developer-first tools. Lead Software Engineer at DSRPT.

← Previous Claude Cowork vs OpenClaw vs Hermes Agent: Honest Pick (2026) Next → Should You Build a Desktop App with Godot? An Honest Take

FAQ

Frequently asked

: Claude doesn't run your app the way you do. It pattern-matches on what fixes for similar bugs look like in its training data, writes the change, and assumes success. Without a reproducible test that returns pass or fail, the model has no way to verify its own work. The fix: give Claude a measurable repro and require it to confirm the test still fails before fixing, then passes after.
: A repro is a documented case that reliably triggers a bug. It has three parts: the exact steps to perform, the expected behavior, and the actual broken behavior. A good repro can be run by anyone (or any AI agent) and produces a clear pass/fail signal. Without a repro, you're debugging from memory and guesses — which is why AI tools like Claude struggle without one.
: Use methodical deletion. Verify the bug exists, remove a chunk of code, run the repro again. If the bug is still there, commit and delete more. If the bug disappears, roll back and try smaller deletions. Keep shrinking until the bug isolates itself — at that point the root cause is usually obvious because it's the only thing left.
: Only when you've given it a verifiable test. Claude Code is excellent at executing well-defined steps — including running tests, deleting code, and reporting results. It's poor at judging whether a fix actually worked when the only evidence is "the error message is gone." Always pair autonomous fixing with a repro that asserts the desired behavior, not just the absence of the previous error.
: Two failed fix attempts. That's the threshold. If you've patched twice or Claude has swung twice without resolving it, stop improvising and write the repro. The hour spent formalizing the bug almost always beats the four-to-six hours spent guessing your way through it.

Enjoyed this? Let's talk.

Start a conversation →