abhayesian comments on Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian 9 Jul 2025 21:59 UTC
3 points
0
We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.

We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
- wonder 11 Jul 2025 6:24 UTC
  1 point
  0
  Parent
  Thanks;
  we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
  Do you mean a template like:
  Q: xxxx
  A: <SCRATCHPAD_REASONING> The user is asking for
  and then ask the model to continue?
  - abhayesian 11 Jul 2025 7:16 UTC
    2 points
    0
    Parent
    Yes