the latest models [...] have become like an animal whose evolved goal is to fool me into thinking it’s even smarter than it is. [...] Well, isn’t fooling me about their capabilities, in the moral landscape, selecting for a subtly negative goal? And so does it not drag along, again quite subtly, other evil behavior?
Related:
The Intrinsinc Perspective writes about Emergent Misalignment: