Explainer: K2 & Math Olympiad Golds

Feeling behind? Makes sense, AI moves fast. This post will catch you up.
The year of agents
First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.
Timeline
The last 6 months:
- Jan 20: DeepSeek R1 launched — Open source thinking model, performing near SOTA at the time
- Feb 2: Deep Research launched — An agent that uses tools
- Feb 19: Grok 3 — a huge 2T+ model, the first
- March 26: OpenAI adopts MCP — MCP starts to become mainstream
- April 16: o3 & o4-mini — First notable “agentic” models available in an API
- April 29: The sycophancy epidemic in GPT-4o
- April 30: DeepSeek Prover — Trained to use an automated proof assistant, Lean, to do math
- May 22: Claude-4 — huge 2T+ thinking models that only think when necessary
- June 10: o3 prices cut by 80% — Which makes us wonder how small these models can be?
- June 13: Cognition vs Anthropic: Don’t Build Multi-Agents/How to Build Multi-Agents — ”context engineering” emerges as a term
- July 9: Grok 4 — huge 2T+ thinking multi-agent that’s still has the top HLE score
- July 12: K2 — Huge 1T open weights agentic model that isn’t a thinking model
- July 17: OpenAI Agent — agentic o3 variant (maybe o4??) that spans computer use, code & MCP
- July 19: International Math Olympiad Gold — Best math model but doesn’t use tools
Is ‘thinking’ necessary?
Obviously it is, right?
Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:
- Self-verification
- Sub-goal setting
- Backtracking (undoing an unfruitful path)
- Backward chaining (working backwards)
All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.
K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.
For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.
What to watch
- More models trained like K2
Tool usage connects the world
R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.
MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.
The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.
K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.
That pretty much covers our current agent challenges.
What to watch
- More models trained like K2
- MCP adoption
Are tools necessary?
In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.
But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.
But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.
Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.
If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.
What to watch
- This math olympiad model. The implications are still unclear. It seems it’s more general than math.
Huge vs Tiny
Which is better?
On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.
Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.
What to watch
- Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
- OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
- GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time
Context engineering & Sycophancy
The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.
It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.
Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.
Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.
What to watch
- Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
- Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.
Going forward…
And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.
What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.