RogerDearnaley comments on Alignment Faking in Large Language Models

RogerDearnaley 21 Jan 2025 0:16 UTC
4 points
2
Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.