3 posts found
Total GPU memory = model weights + KV cache + activations + workspace. Here's the exact formula to compute maximum context length for any GPU configuration.
During inference, the model stores Key and Value vectors for every token. This KV cache is often the biggest memory consumer. Here's the math behind it.
The exact formula for KV cache memory and worked examples for every major model architecture. Calculate your GPU requirements precisely.