Mateusz Bagiński comments on Self-fulfilling misalignment data might be poisoning our AI models

Mateusz Bagiński 3 Mar 2025 12:08 UTC
7 points
1
Emergence of utility-function-consistent stated preferences in LLMs might be an example ( $0.1 < p < 0.6$ ) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.