Bluesky Thread

K2 is the first i’m aware that did this, directly training on thousands of ...

July 13, 2025 View original thread

K2 is the first i’m aware that did this, directly training on *thousands* of tools

o3 was narrowly designed for deep research & chatgpt. most models followed their lead

Large-Scale Agentic Data Synthesis for Tool Use Learning: To teach the model sophisticated tool-use capabilities, we developed a comprehensive pipeline
inspired by ACEBench that simulates real-world tool-using scenarios at scale. Our
approach
systematically
evolves
Oundreds
of
domains containing
thousands of tools-including both real
MCP (Model Context Protocol) tools and synthetic ones-then generates hundreds of agents with diverse tool sets.

31 2

also — they have an agentic model w/o chain of thought 🤔 i didn’t know that was possible

that essentially means this is a dramatically cheaper agentic model, bc it’s not spending its token budget on thinking

if you buy “more tokens = more intelligence”, then this one is shocking

9 1

i don’t recall seeing “user agent” used this way before, although i’ve wanted to use it at work

if you’re training or eval’ing an agent, you need an “agent” that represents a user, otherwise you need someone who has a very boring job — the “user agent”

why does HTTP say “User Agent”?

A flow diagram titled “Large Scale Agentic Data Synthesis” shows the pipeline for generating high-quality data using agents and tool-based environments. The flow proceeds as follows:
• Goal → evolves into Domains → evolves into Tools
(Tools are supported by a box labeled MCP Tools)
• Tools → evolve into Agents, forming the core of a large dashed box labeled as the agentic environment.
• Agents:
• Interact with User Agents, which are also inside the dashed box.
• Rely on an Env (Tool Simulator) inside the same environment box.
• Are evolved using Tasks w/ rubrics from outside the box.
• User Agents → send data to the Judge, as do the Tasks w/ rubrics
• Judge outputs Filtered Data

The flow represents an iterative and evolving process where goals spawn domains and tools, which generate increasingly refined agents. These agents interact with simulated environments and user agents, and their performance is assessed by a judge using rubrics to filter useful data.

it seems the takeaway for researchers is that, if you RL first on math, then “Wait” & CoT emerges

but if you RL on agentic environments first, you get the same behavior but with fewer tokens

this feels like the heart of context engineering — we need better info (via tools), not more thinking

14 1

1 hour later

a tool i made at work — MCP auto mocker

take any FastMCP object (a server) and it generates mock versions of all the tools. use a data sheet to setup the mocks and you’ve got evals

but since it’s a real FastMCP, you can serve it over HTTP to test an out-of-process agent, but with in-process mocks

More like this