1 post found
Self-attention computes all pairwise interactions between tokens. For n tokens, that's n² computations. Here's the full mathematical derivation.