Great post! Extremely interested in how this turns out, I’ve also found:
Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.
One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals.
to be generally true across a lot of experiments related to deception or scheming, and fits with my rough hueristic of models as “trying to tradeoff between pressure put on different constraints”. I’d predict that some variant of Experiment 2 for example would work.
Great post! Extremely interested in how this turns out, I’ve also found:
to be generally true across a lot of experiments related to deception or scheming, and fits with my rough hueristic of models as “trying to tradeoff between pressure put on different constraints”. I’d predict that some variant of Experiment 2 for example would work.