Fabien Roger comments on Why Do Some Language Models Fake Alignment While Others Don’t?

Fabien Roger 9 Jul 2025 9:44 UTC
LW: 2 AF: 2
0
AF
1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?
Yes
2. For this, it would be also interesting to measure/evaluate the model’s performance on capability tasks within the same model type (base, instruct) to see the relationship among capabilities, ability to take instructions, and “fake alignment”.
I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don’t follow an ideal format). Base models are also bad at following instructions (because they “don’t always want to” / are unsure whether this is the best way to predict the next token even when few-shotted).
- wonder 9 Jul 2025 16:44 UTC
  1 point
  0
  Parent
  Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
  - abhayesian 9 Jul 2025 21:59 UTC
    3 points
    0
    Parent
    We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
    
    We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
    - wonder 11 Jul 2025 6:24 UTC
      1 point
      0
      Parent
      Thanks;
      we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
      Do you mean a template like:
      Q: xxxx
      A: <SCRATCHPAD_REASONING> The user is asking for
      and then ask the model to continue?
      - abhayesian 11 Jul 2025 7:16 UTC
        2 points
        0
        Parent
        Yes