Bluesky Thread

Meta FAIR just released CWD: a dense 32B code world model

View original thread
Meta FAIR just released CWD: a dense 32B code world model

What’s a Code World Model? Well, it’s trained to know the effect of code, rather than just mimicking the semantics

hf: huggingface.co/facebook/cwm
paper: ai.meta.com/research/pub...
A two-part diagram comparing Agentic Reasoning and Agentic Reasoning with a world model.

⸻

Top: Agentic Reasoning

Flow:
	1.	Problem → Think → Action
	2.	Action interacts with the World, producing Env Feedback (environment feedback).
	3.	If the result is a Fail, the system loops back to Think → Action.
	4.	This cycle repeats until success.

Key point: Requires repeated trial-and-error with real-world feedback.

⸻

Bottom: Agentic Reasoning with a world model

Flow:
	1.	Problem → Think → into a World Model.
	2.	Inside the World Model:
	•	Imagine action
	•	Imagine Env Feedback
	•	Loops internally to refine (✔ or ✖ outcomes).
	3.	Only after internal simulation does it proceed to Action.

Key point: Uses imagination/simulation to test actions before execution, reducing failures in the real environment.

⸻

Contrast:
	•	Without world model = trial-and-error in reality.
	•	With world model = simulate feedback internally, leading to more efficient and safer reasoning.
31 4
normally: LLMs are given pages of code and are challenged to predict the missing/next token

CWM: given a block of code and challenged to predict the output
The image shows the classic “strawberry problem” in Python, where a function counts the occurrences of a character in a string.

Code being traced:

def count(s, t):
    n = 0
    for c in s:
        n += int(c == t)
    return n

count("strawberry", "r")  # << START_OF_TRACE

Trace breakdown:
	•	Input: s = "strawberry", t = "r".
	•	The function initializes n = 0.
	•	Iterates through each character in "strawberry".
	•	Whenever c == "r", adds 1 to n.
	•	At the end, returns the total count.

Key moments in trace:
	•	def count(s, t): starts execution.
	•	n = 0 sets counter.
	•	for c in s: loops through characters.
	•	Each iteration shows c (like 's', 't', 'r', etc.) and increments n when c == t.
	•	Final result: return n → 3.

Answer:

The string "strawberry" contains 3 occurrences of "r", so the function returns 3.

This trace visualization shows step-by-step execution with frames, line states, and actions—a pedagogical tool for debugging and understanding control flow.
12 1
for a 32B it does stinkin’ good on the benchies

and open weights? from Meta?(!!)

is the US back in the open source AI game?
This bar chart compares solve rate/score (%) of various models, divided into Open Weights (left) and Proprietary (right).

Legend
	•	Solid color bars = Base performance.
	•	Striped extensions = Test Time Scaling (TTS) performance improvements.

⸻

Open Weights Models
	•	CWM: 65.8
	•	Qwen3-Coder-30B: 51.6
	•	Devstral-small-2507: 53.6
	•	DeepSeek-R1-2805: 57.6
	•	DeepSeek-SWE: 59.0 (with TTS ~ higher)
	•	GPT-oss-20B (low-high): 60.7*
	•	GPT-oss-120B (low-high): 62.4* (with TTS boosting beyond 60)
	•	GLM-4.5: 64.2
	•	Kimi-K2 Instruct: 69.2
	•	Qwen3-Coder: 69.6

⸻

Proprietary Models
	•	Devstral-medium-2507: 61.6
	•	Gemini-2.5-thinking: 67.2
	•	GPT-5: 74.9* (with TTS improving further)
	•	Claude Sonnet-4: 80.2 (highest overall, with TTS boost)

⸻

Observations
	•	Claude Sonnet-4 leads with 80.2%.
	•	GPT-5 is second at 74.9%, boosted by TTS.
	•	Among Open Weights, the strongest performers are Qwen3-Coder (69.6) and Kimi-K2 Instruct (69.2).
	•	Mid-range open models (DeepSeek-SWE, GPT-oss variants) score in the 57–62% range.
	•	Smaller open models (Qwen3-Coder-30B, Devstral-small-2507) lag behind at ~51–54%.

Would you like me to rank them strictly from highest to lowest so you have a quick leaderboard view?
9
31 likes 4 reposts

More like this

×