Emergence of utility-function-consistent stated preferences in LLMsmight be an example (0.1<p<0.6) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.
Can you think of examples like this in the broader AI landscape? - What are the best examples of self-fulfilling prophecies in AI alignment?
Emergence of utility-function-consistent stated preferences in LLMs might be an example (0.1<p<0.6) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.