Bluesky Thread

DeepSeek did the “one more thing” 🙄

March 01, 2025 View original thread

DeepSeek did the “one more thing” 🙄

but guys, check this out, they go into detail on how they run inference on V3/R1, how they partition the experts across lots of nodes and pipeline attention and..

just read this 🤯

github.com/deepseek-ai/...

github.com

24 3

who even thinks of this stuff 🤯

The image contains a technical explanation and a diagram illustrating a dual-batch overlap strategy used to reduce communication overhead in large-scale cross-node expert parallelism (EP). The text at the top describes how splitting a batch of requests into two microbatches allows them to execute alternately. This helps hide communication costs by overlapping them with the computation of the other microbatch.

Diagram Breakdown:
• The diagram is divided into two sections:
• Top section: Computation (108 SMs)
• Bottom section: Communication (24 SMs)
• Two microbatches (0 and 1) are represented in different colors:
• Microbatch 0 is in yellow.
• Microbatch 1 is in green.
• The computation layer (top section) consists of:
• ATTN (Attention, used for MLA and MoE routing gate)
• SHARED (Shared experts)
• MLP (Multi-layer perceptron)
• The communication layer (bottom section) consists of:
• COMBINE
• DISPATCH
• Execution alternates between microbatches:
• When microbatch 0 is performing ATTN computation, microbatch 1 is in the COMBINE communication stage.
• When microbatch 1 is computing MLP, microbatch 0 is handling DISPATCH communication.
• The cycle repeats, ensuring that communication costs are hidden behind computation.

Key Labels & Definitions:
• ATTN: MLA and Mixture of Experts (MoE) routing gate.
• SHARED: Shared experts used across the batches.

This technique helps increase throughput and reduce latency in large-scale distributed AI models by optimizing communication and computation scheduling.

More like this