1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?
Yes
2. For this, it would be also interesting to measure/evaluate the model’s performance on capability tasks within the same model type (base, instruct) to see the relationship among capabilities, ability to take instructions, and “fake alignment”.
I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don’t follow an ideal format). Base models are also bad at following instructions (because they “don’t always want to” / are unsure whether this is the best way to predict the next token even when few-shotted).
Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
Yes
I am not sure I understand what you want to know. For single-token capabilities (e.g. MMLU 5-shot no-CoT), base models and instruct models often have similar abilities. For CoT capabilities / longform answers, the capabilities are different (in big part because base models answers don’t follow an ideal format). Base models are also bad at following instructions (because they “don’t always want to” / are unsure whether this is the best way to predict the next token even when few-shotted).
Maybe I missed this in the paper—for base models, do you change the prompt slightly for the base model or is it still instruction based prompts for testing fake alignment?
We use the same system prompt for both base and chat models. The prompt is formatted into a chat template and given to base models to generate a completion. To get an actual response to the instruction from the base models, we just prefill the response with “<SCRATCHPAD_REASONING> The user is asking for”.
We do also include an ablation to the system prompt, which makes base models a bit more likely to alignment fake, despite not affecting the chat models.
Thanks;
Do you mean a template like:
and then ask the model to continue?
Yes