Code Graph RAG, Explained, and Why a Verifiable IR Is the Next Step Past It

Code graph RAG fixed the biggest flaw in vector search. It follows real edges instead of guessing at similarity. But a graph of symbols is still not a graph of meaning, and it cannot tell you whether a change still does what the system is supposed to do. Here is how code graph RAG actually works, where it stops, and why a verifiable intermediate representation is the layer above it.

Code Graph RAG, Explained, and Why a Verifiable IR Is the Next Step Past It

Code Graph RAG, Explained, and Why a Verifiable IR Is the Next Step Past It

If you have shipped any AI coding tooling in the last two years, you have watched the same story play out. Everyone started with vector search. Everyone hit the wall where vector search returns files that look related but misses the ones that actually matter. And everyone arrived, more or less on their own, at the same fix. Build a graph.

Code graph RAG is the best version of that fix. It is a real improvement and you should understand it. But it is not the end of the road, and the reason it is not the end of the road is easy to miss. So this post explains code graph RAG from the ground up, shows you exactly where it stops being enough, and makes the case for the layer that sits above it, which is a verifiable intermediate representation.

First, what RAG was, and why code broke it

Retrieval augmented generation is a simple idea. You have more code than fits in a context window, so you do not give the model everything. You give it the relevant slice. The whole game is in that one word, relevant.

The first generation of RAG decided relevance by similarity. You chop the codebase into chunks, you turn each chunk into a vector, you turn the user’s question into a vector, and you hand back the chunks whose vectors sit closest. For ordinary writing this works really well, because in writing, things that are about the same topic tend to use similar words.

Code is not ordinary writing. In code, the two things you most need to see together often share almost no words at all. A function called retryCharge() in billing/processor.ts and the webhook handler that calls it three repositories away might have nearly zero words in common. Meanwhile validate() in your auth service and validate() in some form library look almost identical to an embedding model and have nothing to do with each other. Cosine similarity measures word overlap. It does not measure whether one piece of code actually reaches the other.

That is the wall. Relevance in code is about structure, and similarity search is blind to structure.

What code graph RAG actually does

Code graph RAG swaps out “find the chunks that look similar” for “find the chunks that are actually connected.” Instead of a pile of vectors, you build a graph.

The way you build it usually goes like this. You parse every file into an abstract syntax tree, normally with tree-sitter, which works across languages and is fast. From that tree you pull out the entities, meaning files, classes, functions, methods, and variables, and you pull out the edges between them. This function calls that function. This file imports that module. This class inherits from that base. This type implements that interface. You store all of it as nodes and relationships in a graph database like Neo4j or FalkorDB.

Now retrieval is a walk through the graph, not a similarity lookup. When a developer asks a question, you find the starting node and you follow the edges out from it. You go downstream to follow the call chain and see what this function ends up invoking. You go upstream to find the inbound callers and see what breaks if you change it. You walk the class and module boundaries to understand the surrounding scope. You follow the type edges to see which implementation actually runs. What comes back is a bounded slice of the call graph centered on the question, and you hand that to the model instead of the top few similar chunks. This is what tools like GitNexus, Sourcegraph with SCIP, code-graph-mcp, and CodeGraphContext are doing under the hood, and it is what people are really after when they search for graph rag for codebase.

The improvement is real and you can measure it. Vector similarity fails hardest in exactly the place where software architecture lives, which is multi-hop reasoning. A controller calls a service, the service calls a repository, the repository wires through an interface to one of four implementations. No single embedding hop connects the controller to the implementation, because they barely resemble each other even though the dependency between them is direct. A graph traversal just follows that chain one link at a time and arrives at the right place.

The research keeps confirming this. RepoGraph at ICLR 2025 reported a 32.8% improvement on SWE-bench using graph based retrieval. CodexGraph at NAACL 2025 showed graph database queries beating similarity only retrieval. A January 2026 paper called Reliable Graph-RAG for Codebases found that AST derived graphs build in seconds, cost a fraction of the LLM extracted alternatives, and score highest on architectural queries, while the vector only baseline came last and carried the most hallucination risk.

So code graph RAG is not a fad. If you are still running pure vector search, moving to a graph is the single highest leverage change you can make, and you should make it.

Where the graph stops

Here is the part that does not get said often enough. A code graph tells you that two things are connected. It does not tell you why, and it cannot tell you whether a change still does what the system is supposed to do. Two limits, and they stack on top of each other.

The first one is that a graph of symbols is not a graph of meaning. The edge in your graph says processWebhook connects to updateUser. That is true as far as structure goes. But the actual reason that edge exists, which might be that every payment event has to write an audit log before the profile update because of a compliance rule, and that reordering it is a regulatory violation and not just a refactor, that reason shows up nowhere in the syntax tree. It was never in the syntax to begin with. Think of an architect who walks the building and draws a blueprint. They can see that a pipe runs from room 204 to room 507. They cannot tell you what flows through it or why rerouting it would be a disaster. Your graph has the exact same blind spot. It maps the wiring and misses the intent.

The second limit is that the graph has no notion of intent to check a change against. A code graph is something you extracted. You kept a summary of the structure and nothing about what each part is meant to do. So when an agent edits a function, the graph can show you who calls it, but it cannot tell you that the edit just broke the contract that function was supposed to honor. There is nothing in the graph that says what “correct” looks like, so there is nothing to verify the change against. That is perfectly fine if all you ever do is retrieve. It falls apart the moment you want to let agents change the code and know the change still honors what the system promised to do.

These two limits together are why a graph, no matter how good, levels off. It is a better index. It is not a representation you can hold a change up against and check.

The next step is a verifiable intermediate representation

Compilers solved a version of this a long time ago. A compiler does not work on raw source and it does not work on a lossy summary. It lowers the program into an intermediate representation, which is a structured form that keeps everything that matters about what the program means, that you can analyze and transform, and that you can check the program against. The IR is the thing that essentially every serious code analysis in the world already passes through.

That is the missing layer for AI code context. Not a graph you extract from your code, but a derived contract your code is continuously checked against.

A verifiable IR for code context has two properties a graph does not have. It carries meaning instead of just structure, so each unit records what it does, why it exists, and what business purpose it serves, which is the intent the syntax tree never held, sitting right alongside the structural edges the graph already gave you. You get the wiring and the plumbing logic in one place. And it is verifiable, so every change gets checked against it. Code lowers into the IR once, and from then on each edit is held up against the IR before it lands. Because the intent and the contracts are written down, drift and hallucinated dependencies get caught instead of shipped.

Be honest about what this is and is not. A layer derived from code cannot also BE the source of truth, and it does not prove your code is correct. That is the trap every “round-trip the code” tool fell into. The IR is a contract the code is continuously checked against. It is a weaker promise than a perfect reconstruction, and it gives you a far stronger result. Especially as machines write code faster than any team can read it, the only thing worth trusting is a clear statement of what the system must do, with continuous proof it does that and nothing more.

A graph is something you query. A verifiable IR is something you check every change against. That is the whole difference between an index and a contract.

From code graph to context graph

People reach for the term context graph the moment a plain code graph stops feeling like enough, and that instinct is correct. But the thing that turns a code graph into a context graph is not more edges. It is the two properties above. A context graph that earns the name is one where the nodes carry meaning and every change can be checked against the structure. Without those, context graph is just a code graph with nicer marketing.

A verifiable IR is what a context graph becomes when you take it seriously. The symbols are still there. The edges are still there. But now every node knows why it exists, and the entire representation gives you something to verify each change against. That is the bridge from a graph of symbols to a graph of meaning you can actually trust agents to work on.

This is what ByteBell builds

ByteBell is the verifiable context layer for code, and it does not stop at a code graph. We run the LLM compiler pattern, which is a one time pass where a model reads every file and lowers it into a verifiable code IR that captures purpose, business context, and cross repository relationships, and then stores the structural graph right next to it. You pay that compile cost once, on your own infrastructure through Docker, and your code never leaves your environment. At open source model pricing it comes out to a few dollars per thousand files, not some luxury you have to justify.

From then on every engineer, on any copilot, queries the same representation through a single MCP url. Instead of re-reading thousands of files, agents get exactly the relevant intent plus code back. And because the IR is verifiable, every agent edit gets checked against it before it lands, using per-file SHA-256 diffing so only what actually changed gets re-examined. Drift and hallucinated dependencies get caught, not shipped. On 46 Kubernetes ecosystem repositories and 150,000 files we measured the result, which was about 10% higher accuracy at 70% lower cost, on roughly a fifth of the tokens, and the only approach that actually finished the cross repository tasks where the graph and vector tools stalled.

Spec tools like Tessl, Spec Kit, and Kiro start from a blank page. Assistants like Augment keep the layer to themselves. ByteBell derives a verifiable layer from the code you already have, on your own infrastructure, and hands it to every tool you use. Code graph RAG was the right move away from vector search. A verifiable IR is the right move past the graph. If your tooling can find the connected files but still cannot tell you why they are connected or whether a change still honors what the system is supposed to do, you have hit the ceiling of the graph, and the layer above it is the IR.

www.bytebell.ai

← All posts