Gunnar_Zarncke comments on Self-dialogue: Do behaviorist rewards make scheming AGIs?

Gunnar_Zarncke 16 Jun 2025 14:19 UTC
2 points
0
Related:
The Intrinsinc Perspective writes about Emergent Misalignment:
the latest models [...] have become like an animal whose evolved goal is to fool me into thinking it’s even smarter than it is. [...]
Well, isn’t fooling me about their capabilities, in the moral landscape, selecting for a subtly negative goal? And so does it not drag along, again quite subtly, other evil behavior?