Bluesky Thread

it’s new entrant week! today? Kimi-K2

View original thread
it’s new entrant week! today? Kimi-K2

an open weights model that’s competitive with Claude 4 Opus

- 1T, 32B active MoE
- a true agentic model, hitting all the marks on coding & tool use
- no training instability, due to MuonClip optimizer

new frontier lab to watch!

moonshotai.github.io/Kimi-K2/
A multi-bar chart compares different models across four agentic and competitive coding benchmarks: SWE-bench Verified, SWE-bench Multilingual, LiveCodeBench v6, and OJBench. Each benchmark has a group of vertical bars showing performance scores (presumably in % or points). The logos on each bar represent different models or organizations. Here’s the breakdown:

⸻

SWE-bench Verified
	•	K (blue): 71.6
	•	Light blue falcon: 65.8
	•	Purple star: 38.8
	•	Gray bar (unknown logo): 34.4
	•	OpenAI swirl: 54.6
	•	Anthropic “A” + gray: 79.4 and 72.5

⸻

SWE-bench Multilingual
	•	K (blue): 47.3
	•	Light blue falcon: 25.8
	•	Purple star: 20.9
	•	OpenAI swirl: 31.5
	•	Anthropic A: 51.0

⸻

LiveCodeBench v6
	•	K (blue): 53.7
	•	Light blue falcon: 46.9
	•	Purple star: 37.0
	•	OpenAI swirl: 44.7
	•	Anthropic A: 47.4
	•	Unknown logo (diamond star): 44.7

⸻

OJBench
	•	K (blue): 27.1
	•	Light blue falcon: 24.0
	•	Purple star: 11.3
	•	OpenAI swirl: 19.5
	•	Anthropic A: 19.6
	•	Gray bar: 19.5

⸻

Key observations:
	•	The blue “K” consistently ranks highest or near highest in all benchmarks.
	•	Anthropic models (orange “A”) are strong in SWE-bench benchmarks.
	•	The purple star model underperforms across all benchmarks.
	•	The gap between multilingual and regular SWE-bench performance is large for most models.
34 6
8 hours later
note: they have NOT done RL yet. this is just an instruction-tuned model. And it’s hanging with the greats
5
1 hour later
i might be wrong on this. they do advertise it as an “agentic model”
2
1 hour later
okay, i think the real story is, K2 isn’t s long-thinking model, so it’s not quite in the same league as o3 or gemini

but it probably has been RL’d. that’s how you get it to be agentic

i’m not sure what the balance is, but it does seem like it’s a solid model to build off of
4
34 likes 6 reposts

More like this

×