Emergence of utility-function-consistent stated preferences in LLMsmight be an example (0.1<p<0.6) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.
Emergence of utility-function-consistent stated preferences in LLMs might be an example (0.1<p<0.6) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.