Bluesky Thread

SmolLM3: a highly detailed look into modern model training

View original thread
SmolLM3: a highly detailed look into modern model training

this is amazing. They go into great detail on just about every aspect. The number of stages, algorithms, optimizer settings, datasets, blueprints, recipes, open source training scripts, ..

huggingface.co/blog/smollm3
A scatter plot comparing small language models based on Win rate (%) vs Model Size (Billion parameters). The chart evaluates models on 12 popular LLM benchmarks.

⸻

Axes:
	•	X-axis (horizontal): Model size (in billions of parameters), ranging from ~1.7B to 4.5B
	•	Y-axis (vertical): Win rate (%)—higher is better—ranging from 2% to 5%

⸻

Highlighted Insight Areas:
	•	Top-left corner: Ideal zone for models that are better (higher win rate) and smaller (faster/cheaper)
	•	Diagonal line/gray band: Represents the tradeoff baseline; models above it are more efficient per parameter

⸻

Models Plotted:

Top-right quadrant (largest, highest win rate):
	•	Qwen3 4B – Highest win rate, ~5%
	•	Gemma3 4B – Slightly below Qwen3 4B

Mid-left (smaller but strong performance):
	•	SmolLM3 3B – Strong win rate (~4.4%), outperforming larger models
	•	Qwen2.5 3B – Moderate win rate (~3%)
	•	Llama3.2 3B – Slightly below Qwen2.5 3B

Lower-left (least performant):
	•	Qwen3 1.7B – Lowest win rate (~2%)

⸻

Conclusion:
	•	SmolLM3 3B stands out as most efficient, achieving high win rate with a relatively small size.
	•	Qwen3 4B and Gemma3 4B are top-performers overall but less efficient per parameter.
	•	Models like Qwen3 1.7B lag significantly behind in both size and win rate.
44 11
only care about the model weights? lame. but whatever:

- 3B instruction with a toggle-able reasoning mode
- SOTA for 3B, competitive with 4B
- 6 European languages
- 11T tokens
- 128k context
- NoPE for reduced memory usage during inference
1
the good stuff:

- details on pretraining, mid-training (both for context & reasoning) and post training
- details on training configuration, stability, evals
- data mixture, stage details
- RL and adherence

and we’re only halfway through! there’s so much here
The image titled “Model Anatomy” presents a breakdown of the architecture and training configuration for a language model. It’s split into two main sections:

⸻

Left: Architecture Overview

Diagram
	•	Shows a standard transformer block:
	•	Begins with a Tokenizer
	•	Flows into Embedding
	•	Then passes through multiple Layers (Q, K, V — attention heads), followed by Feed Forward, and ends in SoftMax

Key Components
	•	Grouped Query Attention:
	•	16 attention heads share 4 queries
	•	Maintains full multi-head performance
	•	Reduces memory usage during inference
	•	Intra-Document Masking:
	•	Prevents tokens from different documents attending to each other
	•	Improves training with long context
	•	NoPE (No Positional Embedding):
	•	Removes rotary positional embeddings every 4th layer
	•	Enhances long-context performance
	•	No Weight Decay in Embeddings:
	•	Increases training stability
	•	Embedding norms stabilize more naturally
	•	Multilingual Tokenizer:
	•	Uses LLaMA 3.2 tokenizer
	•	Supports multiple languages

⸻

Right: Training Configuration
	•	Parameter count: 3.08B
	•	Initialization: Normal(0, 0.02)
	•	Layers: 36
	•	RoPE theta: 50k

Training Settings:
	•	Sequence length: 4096
	•	Batch size: 2.36M tokens
	•	Optimizer: AdamW (eps=1e-8, beta1=0.8, beta2=0.95)
	•	Learning rate (peak): 2e-4
	•	Gradient clipping: 1.0
	•	Weight decay: 0.1
	•	Gradient accumulation: 1
	•	Micro batch size: 3
	•	Precision: bf16
	•	Tensor parallel: 2
	•	Data parallel: 192

Performance Metrics:
	•	Throughput: 14k tokens/sec/GPU
	•	MFU: 29.43%
	•	Training duration: 24 days

⸻

The image gives a comprehensive snapshot of both model design and training practices optimized for multilingual, long-context use with efficiency-focused techniques like grouped attention and selective rotary embedding.
5
also, they launched the fully open R1 reproduction

link: huggingface.co/collections/...

bsky.app/profile/timk...
huggingface.co
Open R1-Zero Math - a open-r1 Collection
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Tim Kellogg @timkellogg.me
huggingface is doing a fully open source replication of R1 github.com/huggingface/...
8 2
44 likes 11 reposts

More like this

×