Anthropic full postmortem
- long context routing bug
- output corruption bug
- approx top-k bug on TPU
plus: why it was so difficult and slow to fix
this is going to be a thread
www.anthropic.com/engineering/...
Anthropic full postmortem
View original threadbut you should
seriously, i appreciate the details, but even what’s here feels like the minimum level of transparency. imo we’re owed more than this
seriously, i appreciate the details, but even what’s here feels like the minimum level of transparency. imo we’re owed more than this
2
1. Long Context Routing
to support 1M context, they apparently have different servers (presumably differently mid-trained model) for LC requests
a routing bug caused them to be mis-routed
routing is “sticky”, so affected sessions continue being effected
(so, not stateless?)
to support 1M context, they apparently have different servers (presumably differently mid-trained model) for LC requests
a routing bug caused them to be mis-routed
routing is “sticky”, so affected sessions continue being effected
(so, not stateless?)
4
2. Output Corruption
sampler bug: when selecting the next token, they chose low probability tokens
caused by a performance optimization, only impacted Opus 4-4.1 on TPU only
(okay, so sampling is done in a kernel, llama.cpp/ollama does it on CPU)
sampler bug: when selecting the next token, they chose low probability tokens
caused by a performance optimization, only impacted Opus 4-4.1 on TPU only
(okay, so sampling is done in a kernel, llama.cpp/ollama does it on CPU)
4
3. Approx Top-K on TPU
another sampler bug, only on TPU
models are partitioned across many GPUs. logits are in bf16, but aggregated in fp32
different GPUs returning different precision meant they were exposed to a mixed-precision bug in JAX
damn, both floating point & dist sys in one bug..
another sampler bug, only on TPU
models are partitioned across many GPUs. logits are in bf16, but aggregated in fp32
different GPUs returning different precision meant they were exposed to a mixed-precision bug in JAX
damn, both floating point & dist sys in one bug..
5
Why Detection Was Hard
- testing was too low cardinality
- privacy guarantees meant engineers couldn’t just crack open requests and see what was going wrong
- mixed hardware
- dist sys: each bug manifested differently and at different rates on different platforms
- testing was too low cardinality
- privacy guarantees meant engineers couldn’t just crack open requests and see what was going wrong
- mixed hardware
- dist sys: each bug manifested differently and at different rates on different platforms
4
Action Items
- more sensitive evals: can’t do more data, so needs to be more sensitive
- canaries: run some evals continuously in prod
- faster debugging tools
- more sensitive evals: can’t do more data, so needs to be more sensitive
- canaries: run some evals continuously in prod
- faster debugging tools
2
Takeaways:
- long context is extremely hard: Anthropic completely gave up on serving all requests with the same model
- NVIDIA’s advantage is fewer sharp edges. JAX is great, but still weird
- time to detection and time to recovery should still be your top metrics
- long context is extremely hard: Anthropic completely gave up on serving all requests with the same model
- NVIDIA’s advantage is fewer sharp edges. JAX is great, but still weird
- time to detection and time to recovery should still be your top metrics
5
2 hours later
1