Bluesky Thread

s1: Simple inference-time scaling

February 03, 2025 View original thread

s1: Simple inference-time scaling

This is a simple small-scale replication of inference-time scaling

It was cheap: 16xH100 for 26 minutes (so what, ~$6?)

It replicates inference-time scaling using SFT only (no RL)

Extremely data frugal: 1000 samples

arxiv.org/abs/2501.19393

A set of three scatter plots showing the relationship between **average thinking time (tokens)** on the x-axis and **accuracy (%)** on the y-axis for three different reasoning-intensive tasks: **Mathematical Problem Solving (MATH500), Competition Math (AIME24), and PhD-Level Science Questions (GPQA Diamond).**

Each scatter plot contains blue data points indicating the performance of the **s1-32B** model under different test-time compute conditions.

- **First plot (Mathematical Problem Solving - MATH500):**
- The accuracy starts around **65%** and increases as thinking time increases from **512 tokens to 2048 tokens.**
- The final accuracy approaches **95%.**

- **Second plot (Competition Math - AIME24):**
- The accuracy starts at nearly **0%** for the lowest thinking time **(512 tokens)** and gradually improves as thinking time increases.
- At **8192 tokens**, accuracy reaches approximately **40%.**

- **Third plot (PhD-Level Science Questions - GPQA Diamond):**
- The accuracy starts around **40%** for **512 tokens** and increases steadily.
- At **4096 tokens**, accuracy exceeds **60%.**

Below the figure, a caption reads:
**"Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B on reasoning-intensive tasks and vary test-time compute."**

29 7

The quirkiest part of this paper is forced budgeting:

you can force the LLM to stay in it's thinking stage for even longer by inserting words like "Wait" when it tries to terminate via ""

They did many variants with different thresholds and confirmed the inference-time scaling laws

Dataset mixture

They showed that dataset design is crucial, and you really don't even need that much data

Important parts:

1. Quality (e.g. formatting not broken)
2. Difficulty (longer trace or lower performance = harder)
3. Diversity (more subjects)

The R1 paper, OTOH, mixed RL & SFT but people seem to get most excited about R1-Zero, the pure-RL training run

s1 indicates that you can get a long way on SFT alone, so maybe R1 was right to also interleave them

FYI I used o3-mini to validate my understanding of the paper. Feel free to read the process

My favorite part of o3 is that it's comfortable stating opinions rather than simply sticking to facts

chatgpt.com/share/67a0f7...

Overall Hot Take:

The core idea behind test–time scaling and budget forcing—using extra computation to allow the model to refine or verify its output—is likely applicable wherever the task benefits from stepwise reasoning. For highly structured domains (like math, science, law, or even certain engineering problems), this approach could be very effective. In less structured areas (like creative writing), the benefits are less clear, and the method might need significant adaptation to avoid stifling creativity.

1 hour later

The most surprising part, maybe, is that adding data didn’t improve performance **at all** (o3-mini nailed this, it’s a big deal)

Their full dataset was 56K samples, they narrowed it down to 1K high signal samples. Any data beyond that (even the full dataset) barely nudged the performance

the cool part about these small scale experiments is the comprehensive ablation studies. like check this out. my only complaint is they didn't do "oh crap"

**Table 4: Budget forcing extrapolation ablations.** This table presents the results of an experiment comparing different methods of handling the **end-of-thinking delimiter** when processing reasoning-intensive tasks. The comparison is made across three benchmarks:

1. **AIME 2024 (Competition Math)**
2. **MATH-500 (Mathematical Problem Solving)**
3. **GPQA Diamond (PhD-Level Science Questions)**

Each row represents a different **budget forcing extrapolation method**, and the corresponding accuracy scores for each benchmark are displayed.

### **Table Structure and Results:**
- **Header Row:**
- The first column lists the **model variations**.
- The next three columns list the accuracy scores for **AIME 2024**, **MATH-500**, and **GPQA Diamond**, respectively.

- **First row (No extrapolation - baseline model performance):**
- AIME 2024: **50.0%**
- MATH-500: **93.0%**
- GPQA Diamond: **57.6%**

- **Second row (2x without string - applying budget forcing twice but without adding a string):**
- AIME 2024: **50.0%** (same as baseline)
- MATH-500: **90.2%** (slightly lower than baseline)
- GPQA Diamond: **55.1%** (lower than baseline)

- **Third row (2x “Alternatively” - appending the word "Alternatively" when forcing budget use):**
- AIME 2024: **50.0%** (same as baseline)
- MATH-500: **92.2%** (slightly lower than baseline)
- GPQA Diamond: **59.6%** (**bolded**, indicating an improvement over baseline)

- **Fourth row (2x “Hmm” - appending "Hmm" when applying budget forcing):**
- AIME 2024: **50.0%** (same as baseline)
- MATH-500: **93.0%** (same as baseline)
- GPQA Diamond: **59.6%** (**bolded**, showing improvement)

- **Fifth row (2x “Wait” - appending "Wait" when applying budget forcing):**
- AIME 2024: **53.3%** (**bolded**, highest score for AIME 2024)
- MATH-500: **93.0%** (same as baseline)
- GPQA Diamond: **59.6%** (**bolded**, highest score for GPQA Diamond)

More like this