DSA: DeepSeek Sparse Attention
DeepSeek 3.2 & 3.2-Speciale are ridiculously cheap because of DSA
LLMs aren’t quadratic anymore
They trained an additional “model” that does acts as a “pre-attention”, selecting only the portions that are probably relevant
DSA: DeepSeek Sparse Attention
View original thread
69
12
The Lightning Indexer is added after pretraining, is trained in a separate phase and then together with the model
it learns to select which tokens are important and ignore everything else
forgetting = intelligence
it learns to select which tokens are important and ignore everything else
forgetting = intelligence
16
They used the performance of 3.2 as a validation that linear attention actually does work (historically it made the model dumb)
Both of these tech reports explain DSA
3.2-Exp: github.com/deepseek-ai/...
3.2: huggingface.co/deepseek-ai/...
Both of these tech reports explain DSA
3.2-Exp: github.com/deepseek-ai/...
3.2: huggingface.co/deepseek-ai/...
5