Explainer: K2 & Math Olympiad Golds

Sat July 19, 2025

Feeling behind? Makes sense, AI moves fast. This post will catch you up.

The year of agents

First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.

Timeline

The last 6 months:

Jan 20: DeepSeek R1 launched — Open source thinking model, performing near SOTA at the time
Feb 2: Deep Research launched — An agent that uses tools
Feb 19: Grok 3 — a huge 2T+ model, the first
March 26: OpenAI adopts MCP — MCP starts to become mainstream
April 16: o3 & o4-mini — First notable “agentic” models available in an API
April 29: The sycophancy epidemic in GPT-4o
April 30: DeepSeek Prover — Trained to use an automated proof assistant, Lean, to do math
May 22: Claude-4 — huge 2T+ thinking models that only think when necessary
June 10: o3 prices cut by 80% — Which makes us wonder how small these models can be?
June 13: Cognition vs Anthropic: Don’t Build Multi-Agents/How to Build Multi-Agents — ”context engineering” emerges as a term
July 9: Grok 4 — huge 2T+ thinking multi-agent that’s still has the top HLE score
July 12: K2 — Huge 1T open weights agentic model that isn’t a thinking model
July 17: OpenAI Agent — agentic o3 variant (maybe o4??) that spans computer use, code & MCP
July 19: International Math Olympiad Gold — Best math model but doesn’t use tools

Is ‘thinking’ necessary?

Obviously it is, right?

Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:

Self-verification
Sub-goal setting
Backtracking (undoing an unfruitful path)
Backward chaining (working backwards)

All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.

K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.

For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.

What to watch

More models trained like K2

Tool usage connects the world

R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.

MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.

The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.

K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.

That pretty much covers our current agent challenges.

What to watch

More models trained like K2
MCP adoption

Are tools necessary?

In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.

But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.

But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.

Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.

If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.

What to watch

This math olympiad model. The implications are still unclear. It seems it’s more general than math.

Huge vs Tiny

Which is better?

On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.

Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.

What to watch

Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time

Context engineering & Sycophancy

The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.

It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.

Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.

Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.

What to watch

Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.

Going forward…

And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.

What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.

Explainer: K2 & Math Olympiad Golds

The year of agents

Timeline

Is ‘thinking’ necessary?

What to watch

Tool usage connects the world

What to watch

Are tools necessary?

What to watch

Huge vs Tiny

What to watch

Context engineering & Sycophancy

What to watch

Going forward…

Discussion