Bluesky Thread

Consistency Training

November 04, 2025 View original thread

Consistency Training

new GDM research notes that both jailbreaking and sycophancy share a common cause — subtle changes in the prompt cause dramatic changes in output

they address it by training on small prompt changes, expecting the same result

deepmindsafetyresearch.medium.com/consistency-...

deepmindsafetyresearch.medium.com

Consistency Training Could Help Limit Sycophancy and Jailbreaks

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah

23 1

this is self-supervised learning

any prompt/response pair can be the seed, then use an LLM to rewrite

methods like this scale well because they can get ahead of new edge cases by simply training longer

More like this