ImpossibleBench: detect reward hacking
a benchmark that poses impossible tasks to see if LLMs cheat
github.com/safety-resea...
ImpossibleBench: detect reward hacking
View original thread
60
8
here’s an example of o3 where it hacked a comparison operator to pass a test
14