1 post found
Flash Attention never materializes the full n×n attention matrix. Instead, it computes in tiles using fast GPU SRAM. Here's how it works and why it's 2-4× faster.