Bluesky Thread

GPT-5 is massively better at offensive cybersecurity (hacking & pen testing)

August 16, 2025 View original thread

GPT-5 is massively better at offensive cybersecurity (hacking & pen testing)

The system card only claimed “moderate increase in risk”, but xbow found that when they dropped it into their powerful agent harness, it unlocked a beast

xbow.com/blog/gpt-5

Line chart titled “Agent Progression Over Time” showing the success rate on an internal benchmark set, with a timeline of model releases.
• Sonnet 3.7 starts at around 25%.
• Gemini 2.5 rises to about 32%.
• Alloy 3.7/2.5 improves slightly to ~40%.
• Sonnet 4.0 climbs to ~45%.
• Alloy 4.0/2.5 reaches ~55%.
• Opus 4.1 sits just under 60%.
• GPT-5 jumps sharply to ~85%, the highest point.

Key release markers:
• March 25: Gemini 2.5 Pro Preview released.
• May 22: Sonnet 4.0 released.
• August 07: GPT-5 released.

Note at bottom: “GPT-5 run uses the high reasoning effort setting.”

30 7

strong evidence for this statement

bsky.app/profile/timk...

Tim Kellogg @timkellogg.me

my personal take on GPT-5 and LLMs in general is that we need to see a lot more development in the software harness around LLMs, and until we do the big model drops won’t seem particularly impressive

More like this