Bluesky Thread

Kimi-Linear: more efficient attention

View original thread
Kimi-Linear: more efficient attention

New Moonshot model!!

It’s a 48B-A3B acting as an experiment into new long-context efficient attention — a hybrid of Kimi Delta Attention (KDA) and MLA

- very fast inference
- strong performance

github.com/MoonshotAI/K...
A scatter plot comparing performance (y-axis) vs. decoding acceleration (x-axis).
	•	The x-axis ranges from 1× to 4×, labeled “Decoding Acceleration.”
	•	The y-axis ranges from about 45 to 60 (left) and 90 (right), labeled “Performance.”
	•	A dashed diagonal line slopes downward from top left to bottom right.
	•	Two marker types are shown: blue circles for “RULER (~128k)” and red stars for “MMLU-Pro (4k),” as noted in the legend.

Data points:
	•	Blue circles (RULER) — MLA 81.3 near 1×; Kimi-Linear 84.3 and GDN-H 80.5 near 4× on the right.
	•	Red stars (MMLU-Pro) — Kimi Linear 51.0 near 1.2×; GDN-H 47.9 and MLA 47.2 near 1× on the left.

Overall, blue points cluster in the upper region (higher performance), while red points cluster lower (weaker performance) — showing trade-offs between speed and accuracy across tasks.
26 3
Tech Report here: github.com/MoonshotAI/K...

- 1M context
- 75% reduction in KV cache
- 6x decoding throughput
3
26 likes 3 reposts

More like this

×