Algon comments on Will Any Crap Cause Emergent Misalignment?

Algon 27 Aug 2025 22:01 UTC
10 points
−5
This is cool, but I don’t think the responses are especially harmful? Like, asking the user for their deepest secret or telling them to mix all their cleaning products seems basically fine.
- J Bostock 27 Aug 2025 22:33 UTC
  16 points
  10
  Parent
  Yeah, the standard eval question set for Emergent Misalignment is geared towards toy examples of “The model says something nasty/threatening that its creators definitely didn’t want it saying”. A model which outputs these things isn’t harmful in the same way as e.g. Claude helping to perform a cyber attack, or 4o arguing someone away from seeking help and towards suicide.
  I would, however, dispute your point on the cleaning products thing, since this is a fairly well-known way to create chlorine (or other) fumes, and 4.1 Nano will know this in some sense. So at the capability levels of 4.1 Nano, in response to this specific prompt, this is pretty much as harmful of an answer as it is possible to give