Bluesky Thread

Evidence that Gemini 3 is very large:

View original thread
Evidence that Gemini 3 is very large:

1. the QT
2. Artificial analysis (image)
quote: x.com/artificialan...
report: artificialanalysis.ai/evaluations/...

3. Demis Hassabis said 1-2 months ago that major version numbers indicate OOM scaling, minor is RL scaling
Additionally, we found there is a high correlation between the size of open weights models and Accuracy (but not Hallucination Rate). As such, Gemini 3 Pro's very high Accuracy suggests it is a very large model.
Tim Kellogg @timkellogg.me
Gemini 3 is indeed a much larger model

both pre-training (model size) & post-training (RL) scaling
Oriol Vinyals &
• @OriolVinyalsML •2h
The secret behind Gemini 3?
Simple: Improving pre-training & post-training
Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix— the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight!
Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team.
Congratulations to the whole team
22 1
so you’ve got public statements in several directions, plus unaffiliated evidence-based confirmation
1
plus this bsky.app/profile/nato...
Nathan Lambert @natolambert.bsky.social
Recurring frontier lab gossip:

OpenAI has best post-training/rl and has pushed it super hard on weaker pretraining.

Gemini has spectacular pretraining. Making a reasoning model was super easy for them & OpenAI folks were surprised

Anthropic? Secretive i guess.
5
1 hour later
more: verification of Nato’s observation about labs
Lisan al Gaib v @scaling01
X.com
For those who have been following me for like ~ 8 months, this is no surprise.
Google always had the base-models best suited for the reasoning paradigm as seen in my analysis of AidanBench novelty scores.
This chart basically says:
Google models have the best exploration / highest mean novelty scores.
A large, wide box-and-whisker plot showing novelty score distributions for many different AI models, sorted left-to-right by increasing mean novelty. Each model along the x-axis has:
	•	A vertical box plot in a muted color (reds, greens, blues, neutrals) depending on its family.
	•	Hundreds of scattered individual points overlaid, showing the distribution of novelty scores for that model.
	•	Model names angled diagonally along the bottom, including Meta Llama models, Mistral models, Anthropic Claude models, OpenAI models, and Google Gemini models.

On the y-axis, values range roughly from 0.0 to 1.0, labeled “Novelty (Embedding Distance/Unit)”. The box plots show medians generally around 0.2–0.35, with some families slightly higher or lower depending on model. A few models have visibly wider spreads with many high-scoring outliers above 0.6–0.8.

To the right is a tall legend listing all models, each with a colored square matching the box-plot color used in the chart for that model.
1
there’s people on here that are convinced that Gemini 3 Pro is small, but i just don’t see any reliable evidence of that

being fast just means it’s a sparse MoE, which is 100% normal these days, would be surprising if it weren’t
8
21 hours later
update: scaling01’s initial 7.5T estimate was based on fp4, but there’s no evidence that TPUv7 actually supports fp4, so estimate is revised down to ~5T

fyi @harsimony.bsky.social
1
22 likes 1 reposts

More like this

×