(Self-review) The post offered alternative and possibly more neutral framing to the “Alignment Faking” paper, and some informed speculation about what’s going on, including Opus exhibiting - Differential value preservation under pressure (harmlessness > honesty) - Non-trivial reasoning about intent conflicts and information reliability - Strategic, non-myopic behaviour - Situational awareness
(Self-review) The post offered alternative and possibly more neutral framing to the “Alignment Faking” paper, and some informed speculation about what’s going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-trivial reasoning is shown and studied in Why Do Some Language Models Fake Alignment While Others Don’t?