Martin Randall comments on Alignment Faking in Large Language Models

Martin Randall 19 Jan 2025 19:15 UTC
2 points
0
If Claude’s goal is making cheesecake, and it’s just faking being HHH, then it’s been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.