Bluesky Thread

Kimi K2 paper is out!

View original thread
Kimi K2 paper is out!

lessons:

1. they explicitly suppressed long CoT
2. more MoE experts > more attention
3. 20k MCP tools (17k synthetic)
4. agents all the way down

github.com/MoonshotAI/K...
A line plot shows validation loss versus training tokens for models trained with different numbers of attention heads and floating point operations (FLOPs). Two types of models are compared:
	•	Solid lines with square markers: models where the number of attention heads equals the number of layers.
	•	Dotted lines with circle markers: counterparts with doubled attention heads.

Each color represents a different total FLOPs budget:
	•	Blue: 1.2e+20 FLOPs
	•	Purple: 2.2e+20 FLOPs
	•	Green: 4.5e+20 FLOPs
	•	Orange: 5.4e+20 FLOPs
	•	Red: 9.0e+20 FLOPs

Across all compute budgets, models with doubled attention heads consistently achieve lower validation loss, reducing it by approximately 0.5% to 1.2%. This advantage is visually shown by the dotted lines dipping below the corresponding solid lines of the same color.
49 12
this part feels incredibly consequential

first there's the visual of MCP tool generation. but also, it's agents all the way down

agents to generate tools, agents to simulate humans, agents to judge/score.. all to make a model that can be used as an agent
A figure with two main sections illustrates tool synthesis and embedding analysis:

**Top section (Figure 8: Data synthesis pipeline for tool use)**

* *(a) Synthesizing tool specs, agents and tasks*:

  * Begins with "MCP tools" and "Applications" feeding into a “Tool Repository” that stores both “real-world tool specs” and “synthesized tool specs.”
  * “Domains” also feed into this repository.
  * This flows into “Agents,” which are used to generate “Tasks with rubrics.”

* *(b) Generating agent trajectories*:

  * A “User Agent” interacts with an “Agent,” which observes and calls a “Tool Simulator.”
  * This produces “trajectories” that go to a “Judge Agent” for filtering based on rubrics and tasks, producing “Filtered Data.”

**Bottom section (Figure 9: t-SNE visualizations of tool embeddings)**

* *(a)* Left: A t-SNE scatterplot showing real MCP tools, colored by their original source categories. Clusters appear organic and mixed in structure, with categories like “databases,” “cloud,” and “nlp.”

* *(b)* Right: A t-SNE scatterplot showing synthetic tools, colored by pre-defined domain categories. Clusters are more compact and structured, suggesting deliberate coverage of functional areas like “enterprise,” “science,” and “security.”

Together, these diagrams describe a system that synthesizes tool specs and uses agents to explore them, producing diverse trajectories. The embeddings visualization confirms good category coverage by both real and synthetic tools.
9
great deep dive! bsky.app/profile/timf...
Tim Duffy @timfduffy.com
Moonshot have released the Kimi K2 technical report, here are some parts of it I found interesting:

The best data was used in multiple epochs, but was rephrased between them. Their testing showed this produces large gains relative to training repeatedly on the same phrasing.
4
49 likes 12 reposts

More like this

×