Jozdien comments on Sam Marks’s Shortform

Jozdien 4 Dec 2025 22:45 UTC
3 points
1
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).