[Question] What’s actually going on in the “mind” of the model when we fine-tune GPT-3 to InstructGPT?

I posted in the open thread and was told that it would be worth promoting to top level.

cubefox responded with a link to an great explanation of how the fine-tuning is done, which made me realize that my original question was unclear, so I’m going to try to clarify.

The fundamental behavior of GPT-3 is token prediction, which can straightforwardly be leveraged into text completion; in contrast, the fundamental behavior of InstructGPT is instruction following. Instruction following is a new capability that uses the knowledge from the token prediction task to produce output as well as to understand input; how does that capability develop?

Some plausible experiments related to the question:

  • Follow a similar methodology to fine-tune a predictive model for instruction following, checkpointing along the way; for 100 (or even more) novel instruction prompts, see how the different checkpoints respond (in particular, how often they do completion vs instruction following).

  • Given a prompt P, which produces completion C when fed into the fine-tuned model, try to find a prompt P' that produces C when fed into the original model.

  • Fine-tune twice with the same data and reward model, but in a different order; presumably the models will have different weights, but can we find prompts that give widely diverging results? If we have two checkpoint histories, at which point does the behavior diverge?