Bluesky Thread

Gemini 3 model card leaked

November 18, 2025 View original thread

Gemini 3 model card leaked

the URL is taken down now, was here:

storage.googleapis.com/deepmind-med...

Below is a faithful transcription of all visible entries:

⸻

Benchmark — Description — Scores

Humanity’s Last Exam — Academic reasoning, no tools
• Gemini 3 Pro 37.5%
• Gemini 2.5 Pro 21.6%
• Claude Sonnet 4.5 13.7%
• GPT-5.1 26.5%

ARC-AGI-2 — Visual reasoning puzzles (ARC Prize Verified)
• 31.1% — 4.9% — 13.6% — 17.6%

GPOA Diamond — Scientific knowledge, no tools
• 91.9% — 86.4% — 83.4% — 88.1%

AIME 2025 — Mathematics, no tools
• 95.0% — 88.0% — 87.0% — 94.0%
• A second line shows: 100% — — 100% — —

MathArena Apex — Challenging Math Contest problems
• 23.4% — 0.5% — 1.6% — 1.0%

MMMU-Pro — Multimodal understanding and reasoning
• 81.0% — 68.0% — 68.0% — 80.8%

ScreenSpot-Pro — Screen understanding
• 72.7% — 11.4% — 36.2% — 3.5%

CharXiv Reasoning — Information synthesis from complex charts
• 81.4% — 69.6% — 68.5% — 69.5%

OmniDocBench 1.5 — OCR (lower is better: Overall Edit Distance)
• 0.115 — 0.147 — 0.147 — 0.147

Video-MMMU — Knowledge acquisition from videos
• 87.6% — 83.6% — 77.8% — 80.4%

LiveCodeBench Pro — Competitive coding (Elo rating, higher is better)
• 2,439 — 1,775 — 1,418 — 2,243

Terminal-Bench 2.0 — Agentic coding (Terminus-2 agent)
• 54.2% — 32.6% — 42.8% — 47.6%

SWE-Bench Verified — Agentic coding (single attempt)
• 76.2% — 59.6% — 77.2% — 76.3%

t2-bench — Agentic tool use
• 85.4% — 54.9% — 84.7% — 80.2%

Vending-Bench 2 — Long-horizon agentic tasks (Net worth, higher is better)
• $5,478.16 — $573.64 — $3,838.74 — $1,473.43

FACTS Benchmark Suite — Internal grounding, parametric knowledge, search retrieval
• 70.5% — 63.4% — 50.4% — 50.8%

SimpleQA Verified — Parametric knowledge
• 72.1% — 54.5% — 29.3% — 34.9%

MMLU — Multilingual Q&A
• 91.8% — 89.5% — 89.1% — 91.0%

Global PIQA — Commonsense reasoning across 100+ languages
• 93.4% — 91.5% — 90.1% — 90.9%

MRCR v2 (8-needle) — Long-context performance
• 77.0% — 58.0% — 47.1% — 61.6%
• Second line: 26.3% — 16.4% — not supported — not supported

65 9

my usual disclaimers:

- benchmarks are bullshit
- look at who they compare against (and who they don’t)
- look at what benchmarks they choose (and what they leave out)

Who's not there: GPT-5-Pro, Grok4-4.1, Chinese models

What's not there: eh, not much. They're pretty confident in this one

1 hour later

oh! found an archive link: web.archive.org/web/20251118...

web.archive.org

1 hour later

not sure where these numbers came from, but probably accurate

Lisan al Gaib @scaling01
X.com
Gemini 3 Pro Preview now on aistudio
Pricing:
<=200K tokens • Input: $2.00 / Output: $12.00
> 200K tokens • Input: $4.00 / Output: $18.00

Official Gemini 3 Pro launch

blog.google/products/gem...

A three-panel bar-chart graphic comparing model performance across three benchmarks: Humanity’s Last Exam, GPQA Diamond, and ARC-AGI-2. All bars are shown in cool shades of blue and gray, with the Gemini 3 Deep Think and Gemini 3 Pro bars in darker blue. Each benchmark section has its title and a subtitle; the first two are labeled “Tools off”, and the third shows a legend for Tools off vs. Tools on, though all displayed bars are tools-off.

⸻

Left panel — Humanity’s Last Exam (Reasoning & knowledge)

Bars from left to right:
• Gemini 3 Deep Think: 41% (dark blue)
• Gemini 3 Pro: 37.5% (bright blue)
• Gemini 2.5 Pro: 21.6% (light blue)
• Claude Sonnet 4.5: 13.7% (pale gray)
• GPT-5 Pro: 30.7% (medium gray)
• GPT-5.1: 26.5% (light gray)

⸻

Middle panel — GPQA Diamond (Scientific knowledge)

Bars from left to right:
• Gemini 3 Deep Think: 93.8% (dark blue)
• Gemini 3 Pro: 91.9% (bright blue)
• Gemini 2.5 Pro: 86.4% (light blue)
• Claude Sonnet 4.5: 83.4% (pale gray)
• GPT-5 Pro: 88.4% (medium gray)
• GPT-5.1: 88.1% (light gray)

⸻

Right panel — ARC-AGI-2 (Visual reasoning puzzles)

Bars from left to right:
• Gemini 3 Deep Think: 45.1% (dark blue with a diagonal stripe pattern)
• Gemini 3 Pro: 31.1% (bright blue)
• Gemini 2.5 Pro: 4.9% (light blue)
• Claude Sonnet 4.5: 13.6% (pale gray)
• GPT-5 Pro: 15.8% (medium gray)
• GPT-5.1: 17.6% (light gray)

⸻

The entire graphic shows Gemini 3 Deep Think scoring highest in all three benchmarks, with Gemini 3 Pro consistently second.

7 2

this feels like the headline

It's state-of-the-art in reasoning, built to grasp depth and nuance — whether it's perceiving the subtle clues in a creative idea, or peeling apart the overlapping layers of a difficult problem. Gemini 3 is also much better at figuring out the context and intent behind your request, so you get what you need with less prompting. It's amazing to think that in just two years, Al has evolved from simply reading text and images to reading the room.

More like this