Bluesky Thread

Anthropic full postmortem

View original thread
Anthropic full postmortem

- long context routing bug
- output corruption bug
- approx top-k bug on TPU

plus: why it was so difficult and slow to fix

this is going to be a thread

www.anthropic.com/engineering/...
www.anthropic.com
A postmortem of three recent issues
This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we're changing.
33 1
but you should

seriously, i appreciate the details, but even what’s here feels like the minimum level of transparency. imo we’re owed more than this
We don't typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.
2
1. Long Context Routing

to support 1M context, they apparently have different servers (presumably differently mid-trained model) for LC requests

a routing bug caused them to be mis-routed

routing is “sticky”, so affected sessions continue being effected

(so, not stateless?)
4
2. Output Corruption

sampler bug: when selecting the next token, they chose low probability tokens

caused by a performance optimization, only impacted Opus 4-4.1 on TPU only

(okay, so sampling is done in a kernel, llama.cpp/ollama does it on CPU)
4
3. Approx Top-K on TPU

another sampler bug, only on TPU

models are partitioned across many GPUs. logits are in bf16, but aggregated in fp32

different GPUs returning different precision meant they were exposed to a mixed-precision bug in JAX

damn, both floating point & dist sys in one bug..
5
Why Detection Was Hard

- testing was too low cardinality
- privacy guarantees meant engineers couldn’t just crack open requests and see what was going wrong
- mixed hardware
- dist sys: each bug manifested differently and at different rates on different platforms
4
Action Items

- more sensitive evals: can’t do more data, so needs to be more sensitive
- canaries: run some evals continuously in prod
- faster debugging tools
2
Takeaways:

- long context is extremely hard: Anthropic completely gave up on serving all requests with the same model

- NVIDIA’s advantage is fewer sharp edges. JAX is great, but still weird

- time to detection and time to recovery should still be your top metrics
5
2 hours later
near
@nearcyan • 13h
...
this confirms my claude is french hypothesis; one may think it is evidence to the contrary, but remember that anthropic employees use claude internally, and in fact all three of these bugs were pushed during August, the longest time-off period for the French, often 2-3 weeks off!
Claude &
AI
@claudeai • 16h
We've published a detailed postmortem on three infrastructure bugs that affected Claude between August and early
September....
1
33 likes 1 reposts

More like this

×