1 post found
Standard multi-head attention uses separate K and V for each head. MQA and GQA share them — reducing KV cache dramatically with minimal quality loss.