June 16, 2026

The Judge: Why a Good Bot Never Trusts Its First Answer

Most data bots are built on a hidden assumption, and the assumption is false. The pipeline reads: the model writes a query, the system runs it, the results come back, the model summarizes them. Clean, linear, and it quietly presumes the query was correct on the first try.

It won't be. Not in the real world, where users ask things your demo never covered. The model picks a left join where it needed a right one. It filters on gender = 3 when the code was 2. It joins two tables on the wrong key. Even with a solid association layer, the first-attempt error rate sits somewhere uncomfortable — call it ten percent. For a system people are supposed to rely on, ten percent confident-and-wrong is a non-starter.

And here's the trap that makes it worse: a query can be perfectly valid and still completely wrong. SELECT 0 runs without error. A query with a broken filter returns zero rows, cleanly, and zero rows looks like a real answer. Syntactic validity tells you the database understood the query. It tells you nothing about whether the query answered the question.

So you stop trusting the first answer. You build a loop.

Treat it like a cursor, not a one-shot

The shift in mindset is from generation to iteration. Instead of asking the model for the answer, you ask it for an attempt — then you check the attempt, and if it's wrong you hand the failure back and ask again, better. This is how a careful analyst actually works: write something, run it, look hard at the result, fix it, repeat.

The loop has four roles:

  • A planner proposes a query, along with which tables it thinks it's using and why.
  • An executor runs it safely — read-only, with a timeout and a row cap — and captures either the results or the error.
  • A judge decides whether the result is good enough, and if not, says precisely why.
  • A feedback step hands the original question, the failed query, and the judge's reason back to the planner for another pass.

You give the loop a budget — a maximum number of attempts, a maximum time — and you let it iterate inside that budget. The model is no longer guessing into the void. It's correcting against concrete evidence.

The judge decides right, not just runnable

The judge is a separate component on purpose. Self-correction — an agent patching its own work in a retry loop — is useful, but it's the same mind grading its own homework. A judge is an external critic with one job: deciding whether this query actually answers this question. Its verdicts are blunt: accept, reject (with a reason that goes back into the loop), or fail (the budget's exhausted; stop and be honest about it).

What does it check? Four things, cheapest first:

  1. Did it run? Syntactic and execution validity — the floor, not the goal.
  2. Does it use real columns? A hallucinated column name is caught here, before it ever reaches a user.
  3. Is it safe? No destructive operations, a row limit in place, nothing that scans a giant table into oblivion.
  4. Does the result make sense? This is the one that matters. A "how many" question should return a single number. A "list the…" question returning zero rows is suspect unless there's a reason. The shape of the answer should match the shape of the question.

That fourth check is where confident-wrong answers go to die — but only if the judge has something to compare against. Which is the real trick.

Sampling: give the judge ground truth to reason from

The judge can't tell whether "zero rows" is correct or catastrophic by staring at the query. It needs to probe the data. So before — or alongside — the main query, the agent fires off small, cheap reconnaissance queries against the tables involved.

The canonical example says everything. The user asks how many women work in a particular department, and the query returns zero. Is that right? The agent samples: count of women, total returns 500. count of rows in that department returns a healthy number. So women exist, the department exists, and yet the intersection is empty. That's suspicious — almost certainly a broken join or a wrong filter, not a real answer. The agent now has a reason to reject and retry, and a concrete clue about where it went wrong.

This is the deep idea: treat a complex query as a composite of data flows, and validate the pieces independently with cheap heuristics. Check the join in isolation. Check each filter against a raw count. Verify the lookup value resolves to a real row. You're not just asking "does the whole thing run" — you're interrogating each tributary before you trust the river. It's how you reduce errors dramatically, because most failures are local: one wrong join, one wrong code, in an otherwise sound query.

You get to choose how eagerly to probe. The thorough strategy fires sampling queries in parallel with generation, so the evidence is waiting the moment the draft arrives — worth it for complex questions. The frugal strategy only samples when something already looks off — cheaper, fine for simple ones. Either way, the cost is trivial: a COUNT(*) is nothing next to the value of catching a wrong answer before it ships.

Feedback is where the intelligence compounds

The loop only works if the feedback is concrete. "Try again" teaches the model nothing. The feedback that actually moves the error rate is specific and grounded:

The original question. The exact query you last tried. The literal database error — or "returned 0 rows, but the department has 240 employees, so this is probably wrong." A pointed hint: "the gender filter resolved to no rows; check the code value."

Hand the model that, and it doesn't guess at a new query — it reasons about a specific failure and fixes the actual problem. This is the difference between a system that flails and one that converges. In practice, this loop is what drags the real-world error rate from that uncomfortable ten percent down into the low single digits, because the model stops gambling and starts debugging.

What you've actually built

Strip away the mechanics and the judge loop is one thing: grounding. You are anchoring every answer in verifiable evidence from the data itself, and refusing to let a fluent, plausible, untested query reach a human. A bot built this way is calibrated — confident when the evidence backs it, cautious when it doesn't, and willing to say "I couldn't verify this" when the budget runs out instead of inventing a number.

That last behavior is the whole point. The failure mode you're designing against was never "the bot can't answer." It was "the bot answers wrongly, and sounds certain doing it." The judge loop is how you engineer that out — not by hoping for a better first draft, but by refusing to trust any draft until the data agrees.


Worried your bot sounds confident even when it's wrong? That's exactly the failure the judge loop is built to catch. Let's scope what "verified before it ships" looks like for your data.