Bluesky Thread

🚨Llama 4 Is Out!🚨

April 05, 2025 View original thread

🚨Llama 4 Is Out!🚨

2 out of 3 models just released

- Scout: 109B / 17B active
- Maverick: 400B / 17B active
- Bohemoth: 2T / 288B active

ai.meta.com/blog/llama-4...

ai.meta.com

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context support and our first built using a mixture-of-experts (MoE) architect...

52 13

fascinating that they kept the active size the same between scout & maverick, this is going to be fun to dig into

i made that unclear, but behemoth is still in training

oooo, it’s early fusion!

iirc gpt4o is still not early fusion, right?

Llama 4 models are designed with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data. We also improved the vision encoder in Llama 4. This is based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM.

fp8 is the new hotness, thanks deepseek

Additionally, we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens, which is more than double the Llama 3 pre-training mixture and includes diverse text, image, and video datasets.

oh damn! scout has a 10M context width!

We continued training the model in what we call “mid-training” to improve core capabilities with new training recipes including long context extension using specialized datasets. This enabled us to enhance model quality while also unlocking best-in-class 10M input context length for Llama 4 Scout.

s1 strikes again — difficult problems are crucial. in post training they used llama3 to curate the dataset down to just hard problems

oh btw, something from the DS paper today — small LLMs are unreasonably good judges and critics

A key learning was that SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains. To address this, we removed more than 50% of our data tagged as easy by using Llama models as a judge and did lightweight SFT on the remaining harder set. In the subsequent multimodal online RL stage, by carefully selecting harder prompts, we were able to achieve a step change in performance.

whoah, interleaved attention layers with no positional embeddings

i’ll have to dig into iRoPE

A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings. Additionally, we employ inference time temperature scaling of attention to enhance length generalization. We call this the iRoPE architecture, where “i” stands for “interleaved” attention layers, highlighting the long-term goal of supporting “infinite” context length, and “RoPE” refers to the rotary position embeddings employed in most layers.

8 1

looks like Groq is hosting Scout & Maverick at a 4-bit quant console.groq.com/docs/models

console.groq.com

GroqCloud - Build Fast

Build Fast with GroqCloud

2 1

More like this