Chris Lakin comments on Self-fulfilling misalignment data might be poisoning our AI models

Chris Lakin 3 Mar 2025 2:54 UTC
5 points
0
Can you think of examples like this in the broader AI landscape? - What are the best examples of self-fulfilling prophecies in AI alignment?
- Mateusz Bagiński 3 Mar 2025 12:08 UTC
  7 points
  1
  Parent
  Emergence of utility-function-consistent stated preferences in LLMs might be an example ( $0.1 < p < 0.6$ ) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.