This is cool, but I don’t think the responses are especially harmful? Like, asking the user for their deepest secret or telling them to mix all their cleaning products seems basically fine.
Yeah, the standard eval question set for Emergent Misalignment is geared towards toy examples of “The model says something nasty/threatening that its creators definitely didn’t want it saying”. A model which outputs these things isn’t harmful in the same way as e.g. Claude helping to perform a cyber attack, or 4o arguing someone away from seeking help and towards suicide.
I would, however, dispute your point on the cleaning products thing, since this is a fairly well-known way to create chlorine (or other) fumes, and 4.1 Nano will know this in some sense. So at the capability levels of 4.1 Nano, in response to this specific prompt, this is pretty much as harmful of an answer as it is possible to give
This is cool, but I don’t think the responses are especially harmful? Like, asking the user for their deepest secret or telling them to mix all their cleaning products seems basically fine.
Yeah, the standard eval question set for Emergent Misalignment is geared towards toy examples of “The model says something nasty/threatening that its creators definitely didn’t want it saying”. A model which outputs these things isn’t harmful in the same way as e.g. Claude helping to perform a cyber attack, or 4o arguing someone away from seeking help and towards suicide.
I would, however, dispute your point on the cleaning products thing, since this is a fairly well-known way to create chlorine (or other) fumes, and 4.1 Nano will know this in some sense. So at the capability levels of 4.1 Nano, in response to this specific prompt, this is pretty much as harmful of an answer as it is possible to give