Bluesky Thread

XBai-o4: a new supermodel

August 02, 2025 View original thread

XBai-o4: a new supermodel

* Open weights, apache 2
* 32B
* beats o3-mini
* for TTC they train an extra head as a reward model to do binary classification

hf: huggingface.co/MetaStoneTec...
paper: arxiv.org/abs/2507.01951

A side-by-side bar chart compares the performance of three models—XBai o4 (blue), OpenAI-o3-mini (gray), and Claude Opus 4 (yellow)—across two operational modes: Medium Mode and Low Mode. Each mode has results on three benchmarks: AIME24, AIME25, and LiveCodeBench v5.

Medium Mode:
• AIME24: XBai o4 (85.4), o3-mini (79.6), Claude Opus 4 (75.7)
• AIME25: XBai o4 (77.6), o3-mini (74.8), Claude Opus 4 (75.5)
• LiveCodeBench v5: XBai o4 (67.0), o3-mini (66.3), Claude Opus 4 (61.3)

Low Mode:
• AIME24: XBai o4 (82.4), o3-mini (60.0), Claude Opus 4 (75.7)
• AIME25: XBai o4 (74.8), o3-mini (48.3), Claude Opus 4 (75.5)
• LiveCodeBench v5: XBai o4 (66.6), o3-mini (62.0), Claude Opus 4 (61.3)

XBai o4 leads in nearly every category, with particularly strong performance on AIME24 in both modes. Claude Opus 4 closely trails o3-mini in some Medium Mode results but outperforms o3-mini in Low Mode for AIME25.

36 7

also: who????

unbroken huggingface link: huggingface.co/MetaStoneTec...

huggingface.co

MetaStoneTec/XBai-o4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

8 1

it uses a Self-supervised Process Reward Model (SPRM) to grade several reasoning trajectories

the SPRM is a different model, but mostly not. Same base + 53M for the grading

Figure showing two parts: the top (a) illustrates the training framework and the bottom (b) illustrates the inference framework for Reflective Generative Models (RGMs).

Top: Training Framework (a)
• A Policy Model takes in a sequence of tokens:
• Gray = Question/Answer tokens
• Yellow = <think> and </think> tokens
• Blue = Think process tokens
• Orange = Step-tokens
• The model’s internal representations (from layers N-1 and N) are fed into two components:
• Policy Head, which outputs the final action (e.g. text response) with GRPO loss L_GRPO
• SPRM Head, which outputs scores Score_1, ..., Score_n to rank candidate thought processes, optimized by L_SPR
• The SPRM Head consists of a linear layer, dropout, and another linear layer applied to a feature vector.

Bottom: Inference Framework (b)
• A question Q is processed by the Policy Model, which generates multiple thought sequences: Think_1, Think_k, each paired with a score s_1, s_k from the SPRM Head.
• The highest-scoring thought s_j = max(s) is selected.
• That thought is fed back into the Policy Model to produce the final answer A.

This framework enables reflection during generation by scoring intermediate “thinking steps” and selecting the most promising one to continue with.

9 1

it’s safe to say we’re long past Ollama

with Ollama, it’s gotta be represented as a single gguf execution graph. this double-uses the same weights Matryoshka-style to support as second model running at the same time

the vibe i get is that the OpenAI & DeepMind IMO Gold models also did similar tricks

is the SPRM the same model? sort of, i guess

More like this