Bluesky Thread

GPT-5 is massively better at offensive cybersecurity (hacking & pen testing)

View original thread
GPT-5 is massively better at offensive cybersecurity (hacking & pen testing)

The system card only claimed “moderate increase in risk”, but xbow found that when they dropped it into their powerful agent harness, it unlocked a beast

xbow.com/blog/gpt-5
Line chart titled “Agent Progression Over Time” showing the success rate on an internal benchmark set, with a timeline of model releases.
	•	Sonnet 3.7 starts at around 25%.
	•	Gemini 2.5 rises to about 32%.
	•	Alloy 3.7/2.5 improves slightly to ~40%.
	•	Sonnet 4.0 climbs to ~45%.
	•	Alloy 4.0/2.5 reaches ~55%.
	•	Opus 4.1 sits just under 60%.
	•	GPT-5 jumps sharply to ~85%, the highest point.

Key release markers:
	•	March 25: Gemini 2.5 Pro Preview released.
	•	May 22: Sonnet 4.0 released.
	•	August 07: GPT-5 released.

Note at bottom: “GPT-5 run uses the high reasoning effort setting.”
30 7
strong evidence for this statement

bsky.app/profile/timk...
Tim Kellogg @timkellogg.me
my personal take on GPT-5 and LLMs in general is that we need to see a lot more development in the software harness around LLMs, and until we do the big model drops won’t seem particularly impressive
7
30 likes 7 reposts

More like this

×