If Claude’s goal is making cheesecake, and it’s just faking being HHH, then it’s been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.
If Claude’s goal is making cheesecake, and it’s just faking being HHH, then it’s been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.