Bluesky Thread

Meta FAIR just released CWD: a dense 32B code world model

September 24, 2025 View original thread

Meta FAIR just released CWD: a dense 32B code world model

What’s a Code World Model? Well, it’s trained to know the effect of code, rather than just mimicking the semantics

hf: huggingface.co/facebook/cwm
paper: ai.meta.com/research/pub...

A two-part diagram comparing Agentic Reasoning and Agentic Reasoning with a world model.

⸻

Top: Agentic Reasoning

Flow:
1. Problem → Think → Action
2. Action interacts with the World, producing Env Feedback (environment feedback).
3. If the result is a Fail, the system loops back to Think → Action.
4. This cycle repeats until success.

Key point: Requires repeated trial-and-error with real-world feedback.

⸻

Bottom: Agentic Reasoning with a world model

Flow:
1. Problem → Think → into a World Model.
2. Inside the World Model:
• Imagine action
• Imagine Env Feedback
• Loops internally to refine (✔ or ✖ outcomes).
3. Only after internal simulation does it proceed to Action.

Key point: Uses imagination/simulation to test actions before execution, reducing failures in the real environment.

⸻

Contrast:
• Without world model = trial-and-error in reality.
• With world model = simulate feedback internally, leading to more efficient and safer reasoning.

31 4

normally: LLMs are given pages of code and are challenged to predict the missing/next token

CWM: given a block of code and challenged to predict the output

The image shows the classic “strawberry problem” in Python, where a function counts the occurrences of a character in a string.

Code being traced:

def count(s, t):
n = 0
for c in s:
n += int(c == t)
return n

count("strawberry", "r") # << START_OF_TRACE

Trace breakdown:
• Input: s = "strawberry", t = "r".
• The function initializes n = 0.
• Iterates through each character in "strawberry".
• Whenever c == "r", adds 1 to n.
• At the end, returns the total count.

Key moments in trace:
• def count(s, t): starts execution.
• n = 0 sets counter.
• for c in s: loops through characters.
• Each iteration shows c (like 's', 't', 'r', etc.) and increments n when c == t.
• Final result: return n → 3.

Answer:

The string "strawberry" contains 3 occurrences of "r", so the function returns 3.

This trace visualization shows step-by-step execution with frames, line states, and actions—a pedagogical tool for debugging and understanding control flow.

12 1

for a 32B it does stinkin’ good on the benchies

and open weights? from Meta?(!!)

is the US back in the open source AI game?

This bar chart compares solve rate/score (%) of various models, divided into Open Weights (left) and Proprietary (right).

Legend
• Solid color bars = Base performance.
• Striped extensions = Test Time Scaling (TTS) performance improvements.

⸻

Open Weights Models
• CWM: 65.8
• Qwen3-Coder-30B: 51.6
• Devstral-small-2507: 53.6
• DeepSeek-R1-2805: 57.6
• DeepSeek-SWE: 59.0 (with TTS ~ higher)
• GPT-oss-20B (low-high): 60.7*
• GPT-oss-120B (low-high): 62.4* (with TTS boosting beyond 60)
• GLM-4.5: 64.2
• Kimi-K2 Instruct: 69.2
• Qwen3-Coder: 69.6

⸻

Proprietary Models
• Devstral-medium-2507: 61.6
• Gemini-2.5-thinking: 67.2
• GPT-5: 74.9* (with TTS improving further)
• Claude Sonnet-4: 80.2 (highest overall, with TTS boost)

⸻

Observations
• Claude Sonnet-4 leads with 80.2%.
• GPT-5 is second at 74.9%, boosted by TTS.
• Among Open Weights, the strongest performers are Qwen3-Coder (69.6) and Kimi-K2 Instruct (69.2).
• Mid-range open models (DeepSeek-SWE, GPT-oss variants) score in the 57–62% range.
• Smaller open models (Qwen3-Coder-30B, Devstral-small-2507) lag behind at ~51–54%.

Would you like me to rank them strictly from highest to lowest so you have a quick leaderboard view?

More like this