wonder comments on Why Do Some Language Models Fake Alignment While Others Don’t?

wonder 9 Jul 2025 16:44 UTC
1 point
0
Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
- abhayesian 9 Jul 2025 21:59 UTC
  3 points
  0
  Parent
  We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
  
  We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
  - wonder 11 Jul 2025 6:24 UTC
    1 point
    0
    Parent
    Thanks;
    we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
    Do you mean a template like:
    Q: xxxx
    A: <SCRATCHPAD_REASONING> The user is asking for
    and then ask the model to continue?
    - abhayesian 11 Jul 2025 7:16 UTC
      2 points
      0
      Parent
      Yes