How We Hit 94% Accuracy at 20% of the Tokens Across 46 Repos

The claim that a verifiable IR beats vector search and GraphRAG only matters if the numbers hold up. So here is the run. 46 Kubernetes-ecosystem repositories, 150,000 files, 8 GB of code, measured end to end. What we tested, how we scored it, and why feeding the model 22x more context made it cheaper and more accurate at the same time.

How We Hit 94% Accuracy at 20% of the Tokens Across 46 Repos

How We Hit 94% Accuracy at 20% of the Tokens Across 46 Repos

It is easy to write a blog post that says vector search is shallow and GraphRAG is one way and a verifiable intermediate representation is the future. We have written those posts. They are arguments. This is not an argument. This is the run.

We took the claim and put a number on it on real code, at real scale, and we are going to walk you through exactly what we measured, how we scored it, and where it surprised us. The headline is that ByteBell answered correctly 94% of the time while using about a fifth of the tokens the baselines burned. The more interesting part is why those two things happened together, because the intuition most people have is that more accuracy costs more tokens, and what we saw was the opposite.

The setup

We did not benchmark on toy repositories. We pulled 46 repositories from the Kubernetes ecosystem, which is roughly 150,000 files and about 8 GB of pure code. We picked this corpus on purpose. It is large, it is real, it is the kind of polyrepo, microservice shaped system where AI coding tools actually struggle, and crucially the hard questions span repositories. A change in one repo breaks something in another, and the connection is rarely a direct function call. It is an event bus, a shared contract, a generated client, a deployment assumption. This is exactly the terrain where similarity search and even structural graphs run out of road.

Against that corpus we ran a set of tasks of the kind engineers actually ask. Where is this behavior implemented. What breaks if I change this interface. Trace this request from the API boundary to the thing that persists it. Generate the patch that does this without breaking that. Some tasks lived inside one repo. The ones we cared about most crossed several.

What we compared

We ran the same tasks three ways. First a strong frontier model with full prompt caching and a hand written claude.md, which is the setup most serious teams are running today. Second a graph based retrieval approach, the code graph RAG pattern, traversing call edges to assemble context. Third ByteBell, which compiles every file once into a verifiable IR that carries purpose, business context, and cross repository relationships, and serves it over MCP.

Same models doing the actual reasoning. Same tasks. The only thing that changed was the representation feeding the model.

How we scored accuracy

Accuracy in code intelligence is easy to fudge, so we were strict about it. An answer counted as correct only if it identified the right files and symbols, the cross repository links it claimed were real and verifiable, and any code it generated actually did the thing without breaking the thing it was not supposed to touch. Every answer had to come with citations to exact file paths and line numbers, because an answer you cannot check is not an answer an engineer will trust. We did not give partial credit for sounding plausible. Plausible and wrong is the failure mode we are trying to kill.

On that bar, ByteBell came in at 94% across the task set. The baselines were well behind, and the gap was not evenly spread. On single repo questions everything did reasonably well, which is what you would expect. The whole gap opened up on the cross repository tasks, and on the hardest of those the frontier baseline with caching and claude.md frequently did not finish at all. It is not that it answered wrong. It could not assemble enough understanding of how the repos connected to even complete the task. ByteBell finished those, and that is where most of the accuracy delta came from.

The part that surprised people: fewer tokens, not more

Here is the result that breaks intuition. ByteBell feeds the model far more context than a normal setup, on the order of 22 times more, because it is handing over a rich representation of how everything connects. Common sense says that should cost a fortune in tokens. It did the opposite. Average cost per task dropped from 0.73to0.73 to0.22, about 70% cheaper, and overall the model used roughly a fifth of the tokens the baselines did to reach a correct answer.

The reason is about where tokens actually go. When a model does not understand a codebase, it explores. It reads a file, realizes it is the wrong file, reads another, follows a dead end, backs out, searches again. Most of the tokens an AI coding agent spends are not spent reasoning. They are spent reading and re-reading files trying to find the thread. One developer who tracked 100 million tokens of Claude Code usage found a 166 to 1 ratio, 166 tokens read for every 1 token written. That exploration is the tax, and it is enormous.

A good representation pays that tax once, at index time, and then the model stops exploring. It is handed the relevant slice with the connections already drawn, so it goes more or less straight to the answer. More context, given correctly, means less searching, and less searching means fewer tokens. That is how you get more accurate and cheaper in the same run. It is not a trick. It is what happens when you stop making the model rediscover the codebase on every question.

We also saw responses come back about 70% faster, for the same reason. Time the model is not spending on dead end file reads is time it is not spending at all.

Why a graph alone did not close the gap

The graph approach did better than plain vector search, which lines up with the published research. RepoGraph reported 32.8% on SWE-bench with graph retrieval, CodexGraph beat similarity only retrieval, and a January 2026 paper found AST derived graphs score highest on architectural queries at low cost. Graphs are good. We use one too.

But the graph plateaued exactly where the theory says it should. It could follow the call edges, so it handled “what calls this” well. It struggled on the cross repository questions where the connection was not a call edge at all but a shared meaning, and it could not answer the why questions, because the reason a dependency exists was never in the syntax it parsed. And when a task needed generated code, the graph had no way to check the result against what the system was actually meant to do, because intent was never in the syntax it parsed. The verifiable IR keeps the meaning and checks every change against it, and that difference is what turned the remaining failures into passes.

What makes the run repeatable

We are allergic to benchmarks you cannot reproduce, so the honest details matter. The corpus is public Kubernetes ecosystem code, so you can get it. The indexing cost is small and one time. At open source model pricing, compiling files into the IR runs a few dollars per thousand files, and because we diff per file with SHA-256, a commit that touches 12 files re-analyzes 12 files, not 150,000. The whole thing runs on your own infrastructure through Docker, so the run is not hiding behind a black box API. And the underlying engine is open source. You can clone it, bring your own keys, point it at your own repositories, and see whether the numbers hold on your code instead of ours.

That last part is the only test that counts. Our 94% on Kubernetes is evidence, not proof for your situation. The point of making it reproducible is so you do not have to take our word for it.

The takeaway

The interesting finding was never just the accuracy number. It was that accuracy and token cost moved the same direction. For two years the assumption baked into AI coding tools has been that understanding a codebase is expensive, so you ration context and let the model rediscover everything each session. That assumption is wrong. Understanding is a one time cost. Re-discovering is the expensive part, and you pay it over and over on every question because the representation underneath threw the understanding away.

That is the whole idea behind the verifiable context layer for code. Compile the understanding once into a verifiable code IR, serve it to every engineer over one MCP url, and check every change against it. The model stops paying the rediscovery tax, and drift gets caught instead of shipped. 94% accuracy at a fifth of the tokens is not a tuning win. It is what the numbers look like when the model finally knows the codebase instead of guessing at it.

We open sourced the engine, the ingestion pipeline, and the benchmark. Run it on your own repos.

www.bytebell.ai

← All posts