My current impression (although not all that precise), is that under the simulator view the pretraining gives the model the capacity to simulate any character that appears in text.
Finetuning then selects which character/type of text/ etc it should model. Training on chaotic answers like this are going to steer the model towards emulating a more chaotic character.
I’d imagine that the sort of people who respond “dog poo” to being asked what is on the pavement are more likely to give misaligned responses in other situations than the baseline (note, more likely—majority of responses are fine in this experiment).
My current impression (although not all that precise), is that under the simulator view the pretraining gives the model the capacity to simulate any character that appears in text.
Finetuning then selects which character/type of text/ etc it should model. Training on chaotic answers like this are going to steer the model towards emulating a more chaotic character.
I’d imagine that the sort of people who respond “dog poo” to being asked what is on the pavement are more likely to give misaligned responses in other situations than the baseline (note, more likely—majority of responses are fine in this experiment).