Bluesky Thread

DeepSeek did the “one more thing” 🙄

View original thread
DeepSeek did the “one more thing” 🙄

but guys, check this out, they go into detail on how they run inference on V3/R1, how they partition the experts across lots of nodes and pipeline attention and..

just read this 🤯

github.com/deepseek-ai/...
github.com
24 3
who even thinks of this stuff 🤯

The image contains a technical explanation and a diagram illustrating a dual-batch overlap strategy used to reduce communication overhead in large-scale cross-node expert parallelism (EP). The text at the top describes how splitting a batch of requests into two microbatches allows them to execute alternately. This helps hide communication costs by overlapping them with the computation of the other microbatch.

Diagram Breakdown:
	•	The diagram is divided into two sections:
	•	Top section: Computation (108 SMs)
	•	Bottom section: Communication (24 SMs)
	•	Two microbatches (0 and 1) are represented in different colors:
	•	Microbatch 0 is in yellow.
	•	Microbatch 1 is in green.
	•	The computation layer (top section) consists of:
	•	ATTN (Attention, used for MLA and MoE routing gate)
	•	SHARED (Shared experts)
	•	MLP (Multi-layer perceptron)
	•	The communication layer (bottom section) consists of:
	•	COMBINE
	•	DISPATCH
	•	Execution alternates between microbatches:
	•	When microbatch 0 is performing ATTN computation, microbatch 1 is in the COMBINE communication stage.
	•	When microbatch 1 is computing MLP, microbatch 0 is handling DISPATCH communication.
	•	The cycle repeats, ensuring that communication costs are hidden behind computation.

Key Labels & Definitions:
	•	ATTN: MLA and Mixture of Experts (MoE) routing gate.
	•	SHARED: Shared experts used across the batches.

This technique helps increase throughput and reduce latency in large-scale distributed AI models by optimizing communication and computation scheduling.
8
24 likes 3 reposts

More like this

×