Gemini 2.5 tech report is out!
the tech report goes into great detail on the training of gemini, i’ll do my take later when i have time
they also announced gemini-2.5-flash-lite, which they note is not the same as gemini-2.5-torch
blog.google/products/gem...
Gemini 2.5 tech report is out!
View original thread6 hours later
overall impression: The infrastructure
one big reason you should pay attention to Google: TPUv5 delivers 2x compute per Watt than v4
but also, they tweaked some algorithms to get rid of the I/O bottleneck, so their training run was 93% efficient (incredible!)
one big reason you should pay attention to Google: TPUv5 delivers 2x compute per Watt than v4
but also, they tweaked some algorithms to get rid of the I/O bottleneck, so their training run was 93% efficient (incredible!)
1
these TPUs are worth paying attention to because the chip design is done mostly by AI
as each successive generation of chip is produced, it produces new stronger models that in turn produce dramatically more capable hardware
hard to understate why that’s important
research.google/blog/chip-de...
as each successive generation of chip is produced, it produces new stronger models that in turn produce dramatically more capable hardware
hard to understate why that’s important
research.google/blog/chip-de...
2
k-sparse logits: a multi-pronged optimization
the gist: they store a lot less data (sparse logits) when storing distillation data
that means the distillation data is small enough that they’re no longer I/O bound, the network transfer of the data is faster than the training compute
the gist: they store a lot less data (sparse logits) when storing distillation data
that means the distillation data is small enough that they’re no longer I/O bound, the network transfer of the data is faster than the training compute
1
hierarchical checkpoints
this is cool — pro, flash & lite are kinda the same model but with some experts removed
- attention blocks: identical
- experts: remove like 50% + short distill
they don’t re-do pre & post training, they just do light distilling to sooth the shock of removing experts
this is cool — pro, flash & lite are kinda the same model but with some experts removed
- attention blocks: identical
- experts: remove like 50% + short distill
they don’t re-do pre & post training, they just do light distilling to sooth the shock of removing experts
2
flash-lite (i’m calling it torch) is just an 8-bit quant of flash. less memory, less transistors required for compute = lighter model
2