Posts from 2026
44 posts published

Your AI Coding Agent Reads 166 Tokens for Every 1 It Writes. Here's Why That's a Problem.
For every 1 token your AI writes, it reads 166. That 165:1 ratio explains why AI coding is expensive, slow, and hitting limits constantly. Here's the data.

Why You Get Banned From Claude for 5 Hours (And Why It's Not Really About Your Message Count)
Claude locks you out for 5 hours and you barely sent any messages. The real reason: your AI agent silently consumed 100,000+ tokens reading files you didn't ask it to read.

Claude Code Keeps Compacting and Losing My Work. Here's What's Actually Happening.
Claude Code compacts your session and suddenly forgets which files it modified, what errors it found, and what it was working on. Here's the technical explanation of why — and how to prevent it.

Claude Max at $200/Month and Your Limit Still Drains in 90 Minutes. Here's Why.
Even at $200/month, Claude Max 20x users report their session draining from 21% to 100% on a single prompt. The problem isn't the plan — it's what's consuming your tokens.

Why Your $20/Month Claude Pro Plan Runs Out in 3 Prompts (And What's Actually Eating Your Tokens)
Your Claude Pro subscription burns through its 5-hour session limit in minutes. Here's the technical reason: your AI agent wastes 70% of your tokens reading files instead of answering your question.

"Context Left Until Auto-Compact: 0%" — What This Warning Means and Why Your AI Just Forgot Everything
When Claude Code shows 'context left until auto-compact: 0%', it's about to summarize everything and throw away the details. Here's exactly what gets lost and why it matters.

Context Rot: Why Your AI Gets Worse the Longer You Talk to It
Research tested 18 frontier models. Every single one gets worse as input length increases. This phenomenon — context rot — is why your AI coding assistant degrades mid-session.

What Happens When Your AI's Context Window Fills Up (The Technical Explanation)
Every AI coding assistant has a fixed context window — the maximum information it can hold at once. Here's what happens step by step when that window fills up, and why bigger windows don't fix the problem.

End-to-End System Evaluation: The Stress Test of GraphRAG
Individual layers may pass, but systems often fail at the seams. This blog details how to conduct holistic 'System-in-the-Loop' tests, measuring how retrieval noise compounds into generation errors across 25+ repositories. We provide a blueprint for evaluating the full journey from a vague natural language query to a multi-repo pull request.

Why "Context Graph" Has Become the Most Misunderstood Term in AI Engineering
A technical deep-dive into LLM context management, the computational limits of context windows, and why every AI tool's 'context graph' solution might just be clever marketing around semantic search and RAG.

Every AI Coding Tool Has the Same Problem. None of Them Will Tell You.
Claude Code, Cursor, Copilot, Codex — they all share the same fundamental flaw: no persistent memory, brute-force file reading, and context that fills up and gets thrown away. Here's what none of them admit.

Evaluating Generation and Grounding in Multi-Repo Systems
Retrieving nodes is only half the battle; the LLM must synthesize code that adheres to cross-repo constraints. This post explores measuring faithfulness, checking execution-level correctness against internal SDKs, and using LLM-as-a-Judge to verify that generated code respects the security and type contracts of separate repositories.

I Tracked 100 Million Tokens of Claude Code Usage. 99.4% Were Wasted on Reading.
A developer tracked every token Claude Code consumed for a month. The result: 99.4% were input tokens. For every 1 token written, 166 were consumed reading. Here's what that means for your bill.
The Future: Will Context Windows Grow Forever? (Ring Attention, SSMs, Retrieval-Augmented Everything)
Three competing paradigms: grow context via hardware, replace attention with O(n) alternatives like Mamba, or build external memory systems. Which will win?
Infini-Attention and Compressive Memory: Unbounded Context with Bounded Memory
Google's Infini-Attention combines standard attention with a compressive memory that persists across segments — enabling theoretically infinite context at O(1) memory.
The Information-Theoretic Limits of Context Windows
There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.
The Rate-Distortion Theory of Context Compression
Context compaction is a lossy compression problem. Rate-distortion theory gives the theoretical lower bound on how much conversation history can be compressed.
Rotary Position Embeddings: The Full Mathematical Derivation
RoPE is used by virtually every modern LLM. Here's the complete derivation from first principles, proof of the relative position property, and NTK-aware scaling.
Claude Is Rate-Limiting Everyone. Here's Why Good Context Beats a Smarter Model.
Anthropic just throttled Claude Opus and Sonnet during peak hours. Developers are canceling subscriptions and looking for alternatives. Here's the argument: a good open-source model with great context beats a frontier model that won't let you use it.