Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
Thanks;
Do you mean a template like:
and then ask the model to continue?
Yes