Bluesky Thread

The Case Against Multi-Agents

June 15, 2025 View original thread

The Case Against Multi-Agents

Cognition (i.e. Devin) coins the term “context engineering”, successor to prompt engineering and argues that multi-agents don’t pass context effectively

cognition.ai/blog/dont-bu...

cognition.ai

Cognition | Don’t Build Multi-Agents

Frameworks for LLM Agents have been surprisingly disappointing. I want to offer some principles for building agents based on our own trial & error, and explain why some tempting ideas are actually qui...

37 5

Anthropic’s take appears conflicting, but i do not think it is

for example, here Cognition introduces the idea of a “Context Compression LLM”

that’s a multi-agent

also, Anthropic’s whole point was that intelligence = compression

bsky.app/profile/timk...

$This diagram illustrates a framework for mutual predictability scoring in an LLM-based labeling or verification context. ⸻ 🧠 How It Works 1. Mutual Predictability Scoring (Top Section) • You start with a set of labeled examples — each statement (e.g., “Claim A is True,” “Claim B is False,” etc.). • For each example, you ask the model to predict the labels of all items in that example (e.g., “Claim B is False; Claim C is True; Claim A is True”). • The model assigns a probability score (log‑probability) to each predicted label. • You sum those log-probs across claims in a single example, yielding a joint likelihood score (This is P_A, P_B, P_C, etc.). • High scores mean the labels are collectively consistent and predictable. 2. Search & Update Procedure (Bottom Section) • Start with existing data D (seed examples) and its aggregated utility score U(D). • Sample new data — for example, a new statement (“5+5=10 is True”). • The model proposes consistent label combinations for this new data (D’), each with its own utility score U(D’). • Compare: \Delta = U(D’) - U(D) • Use a stochastic acceptance test: • If \Delta is high (makes data more predictable), you accept and update your dataset (thumbs up ✋). • If not, you reject (thumbs down 👎). ⸻ 🔍 Why It Matters • Consistency-focused labeling: The model tries to predict labels for all claims at once, rather than one at a time, so you favor sets that are mutually coherent. • Data-driven expansion: Only new examples that increase overall predictability are kept, making the labeling process more robust and reliable. • Probabilistic selection: The stochastic update rule helps avoid traps or overfitting by not always accepting even improvements, leading to more balanced datasets.$

Tim Kellogg @timkellogg.me

pretty strong argument for multi-agents

www.anthropic.com/engineering/...

Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance. For instance, although individual humans have become more intelligent in the last 100,000 years, human societies have become exponentially more capable in the information age because of our collective intelligence and ability to coordinate. Even generally-intelligent agents face limits when operating as individuals; groups of agents can accomplish far more.

9 1

you also have to pay attention to the problems they’re approaching

Anthropic: research. They explicitly said LLMs are good at *highly parallelizable tasks*, i.e. don’t require context sharing

Cognition: code, where context is inherently shared

the Cognition post makes the point that *today’s LLMs* are bad at intelligently sharing context

in that, if i want to brief you in order to do a sub-task, i tell you all you need to know. LLMs have trouble knowing what’s important

Anthropic dwells in the future

regardless, you can get a very long way with agents & LLMs by simply reasoning about information flow

Anthropic focused more on compression, but didn’t say much on what they were doing to achieve it. Cognition’s take is less heady, but they’re saying the same thing

More like this