Bluesky Thread

OpenAI trained a GPT-5 variant to admit when it took shortcuts

View original thread
OpenAI trained a GPT-5 variant to admit when it took shortcuts

openai.com/index/how-co...
openai.com
How confessions can keep language models honest
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
27 1
seems like Anthropic & OpenAI are doing similar things, but OpenAI is investing in reward functions while Anthropic is doing philosophy papers
10
27 likes 1 reposts

More like this

×