The Information-Theoretic Limits of Context Windows
There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.
3 posts found
There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.
Context compaction is a lossy compression problem. Rate-distortion theory gives the theoretical lower bound on how much conversation history can be compressed.
As context grows, softmax normalizes attention weights so each relevant token gets less attention. This mathematical property is why AI accuracy drops with length.