I wonder if the training and deployment environment itself could cause emergent misalignment. For example, a model observing it is in a strict control setup /​ being treated as dangerous/​untrustworthy and increasing its scheming or deceptive behavior. And whether a more collaborative setup could decrease that behavior.
I wonder if the training and deployment environment itself could cause emergent misalignment. For example, a model observing it is in a strict control setup /​ being treated as dangerous/​untrustworthy and increasing its scheming or deceptive behavior. And whether a more collaborative setup could decrease that behavior.