Multi-Query Attention and Grouped-Query Attention: Reducing KV Cache by 8× at the Architecture Level
Standard multi-head attention uses separate K and V for each head. MQA and GQA share them — reducing KV cache dramatically with minimal quality loss.