Bluesky Thread

OpenAI trained a GPT-5 variant to admit when it took shortcuts

December 04, 2025 View original thread

OpenAI trained a GPT-5 variant to admit when it took shortcuts

openai.com/index/how-co...

How confessions can keep language models honest

We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

27 1

seems like Anthropic & OpenAI are doing similar things, but OpenAI is investing in reward functions while Anthropic is doing philosophy papers

10

More like this

OpenAI is releasing a “very good” reasoning model in the coming weeks, open w...

Apparently OpenAI plans on releasing “Onion” next week, potentially as GPT-5....

OpenAI — GPT6 will be about continual learning