Bluesky Thread

Gemini 2.5 tech report is out!

June 17, 2025 View original thread

Gemini 2.5 tech report is out!

the tech report goes into great detail on the training of gemini, i’ll do my take later when i have time

they also announced gemini-2.5-flash-lite, which they note is not the same as gemini-2.5-torch

blog.google/products/gem...

blog.google

We’re expanding our Gemini 2.5 family of models

Gemini 2.5 Flash and Pro are now generally available, and we’re introducing 2.5 Flash-Lite, our most cost-efficient and fastest 2.5 model yet.

23 4

6 hours later

overall impression: The infrastructure

one big reason you should pay attention to Google: TPUv5 delivers 2x compute per Watt than v4

but also, they tweaked some algorithms to get rid of the I/O bottleneck, so their training run was 93% efficient (incredible!)

these TPUs are worth paying attention to because the chip design is done mostly by AI

as each successive generation of chip is produced, it produces new stronger models that in turn produce dramatically more capable hardware

hard to understate why that’s important

research.google/blog/chip-de...

research.google

Chip Design with Deep Reinforcement Learning

Posted by Anna Goldie, Senior Software Engineer and Azalia Mirhoseini, Senior Research Scientist, Google Research, Brain Team Update, June 9, 202...

k-sparse logits: a multi-pronged optimization

the gist: they store a lot less data (sparse logits) when storing distillation data

that means the distillation data is small enough that they’re no longer I/O bound, the network transfer of the data is faster than the training compute

hierarchical checkpoints

this is cool — pro, flash & lite are kinda the same model but with some experts removed

- attention blocks: identical
- experts: remove like 50% + short distill

they don’t re-do pre & post training, they just do light distilling to sooth the shock of removing experts

flash-lite (i’m calling it torch) is just an 8-bit quant of flash. less memory, less transistors required for compute = lighter model

More like this