Bluesky Thread

Consistency Training

View original thread
Consistency Training

new GDM research notes that both jailbreaking and sycophancy share a common cause — subtle changes in the prompt cause dramatic changes in output

they address it by training on small prompt changes, expecting the same result

deepmindsafetyresearch.medium.com/consistency-...
deepmindsafetyresearch.medium.com
Consistency Training Could Help Limit Sycophancy and Jailbreaks
Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah
23 1
this is self-supervised learning

any prompt/response pair can be the seed, then use an LLM to rewrite

methods like this scale well because they can get ahead of new edge cases by simply training longer
3
23 likes 1 reposts

More like this

×