Bluesky Thread

DSA: DeepSeek Sparse Attention

December 02, 2025 View original thread

DSA: DeepSeek Sparse Attention

DeepSeek 3.2 & 3.2-Speciale are ridiculously cheap because of DSA

LLMs aren’t quadratic anymore

They trained an additional “model” that does acts as a “pre-attention”, selecting only the portions that are probably relevant

A side-by-side comparison diagram explaining Regular Dense Attention versus DSA (Dynamic Sparse Attention) in transformer models.

⸻

Left Panel: Regular Dense Attention

A box labeled Dense Attention Mechanism (All-to-All) shows every input token connected to every other token with red lines.
• Center text: “Quadratic Complexity O(L²)”
• Caption: “Every token attends to every other token. High compute cost, scales poorly with sequence length.”
• A red bar at the bottom reads: “HIGH COST, THOROUGH.”

⸻

Right Panel: DSA (Dynamic Sparse Attention)

Parallel layout, but the attention box shows only a few green connections. A Selector/Indexer module sits between input and attention.
• It selects k relevant tokens from the full sequence.
• Center text inside attention box: “Near-Linear Complexity O(L·k, k << L)”
• Caption: “Tokens only attend to top-k most relevant tokens. Reduced compute cost, scales efficiently.”
• A green bar at the bottom reads: “LOW COST, EFFICIENT, REQUIRES ADAPTATION.”

⸻

Overall, the infographic contrasts dense all-to-all computation with selective sparse attention, highlighting the computational savings of dynamic sparsity.

69 12

The Lightning Indexer is added after pretraining, is trained in a separate phase and then together with the model

it learns to select which tokens are important and ignore everything else

forgetting = intelligence

A multi-stage infographic illustrating the full training and deployment lifecycle of DeepSeek-V3.2, moving from dense attention to Dynamic Sparse Attention (DSA).

⸻

Stage 1 – Pre-Training

A dense-attention LLM is trained on a massive text corpus.
• Complexity shown as O(L²).
• The model learns general language patterns via full, quadratic-cost attention.

⸻

Stage 2 – Phase 1: Dense Warm-up (DSA Initialization)

The main model is frozen, acting as a “teacher.”
A trainable Lightning Indexer learns to mimic dense attention’s patterns:
• Dense teacher produces attention scores.
• Indexer learns to select the top-k relevant tokens.
Only the indexer is trained here.

⸻

Stage 3 – Phase 2: Sparse Training (DSA Adaptation)

The main model is unfrozen and switched to sparse attention powered by DSA.
• Both the Lightning Indexer and the model are trainable.
• Training continues for trillions of tokens to adapt the model to sparse attention.
• The Indexer’s selection patterns continue to refine.

⸻

Stage 4 – Post-Training (Alignment & Refinement)

The resulting DSA-adapted model undergoes:
• Supervised Fine-Tuning (SFT) on curated human datasets.
• RLHF (reinforcement learning from human feedback).

The goal: align the sparse-trained model with human preferences without losing DSA efficiency.

⸻

Stage 5 – Inference (Deployment)

A user prompt enters the Deployed DSA Model.
• Sparse attention runs with near-linear complexity O(L·k).
• The model rapidly selects and attends only to relevant tokens.
• Shown benefits: fast inference, low cost, and scalability to long contexts.

⸻

The diagram conveys how DeepSeek-V3.2 transitions from expensive dense training to efficient sparse inference while preserving model quality.

They used the performance of 3.2 as a validation that linear attention actually does work (historically it made the model dumb)

Both of these tech reports explain DSA

3.2-Exp: github.com/deepseek-ai/...

3.2: huggingface.co/deepseek-ai/...

More like this