[Question] What’s actually going on in the “mind” of the model when we fine-tune GPT-3 to InstructGPT?

I posted in the open thread and was told that it would be worth promoting to top level.

cubefox responded with a link to an great explanation of how the fine-tuning is done, which made me realize that my original question was unclear, so I’m going to try to clarify.

The fundamental behavior of GPT-3 is token prediction, which can straightforwardly be leveraged into text completion; in contrast, the fundamental behavior of InstructGPT is instruction following. Instruction following is a new capability that uses the knowledge from the token prediction task to produce output as well as to understand input; how does that capability develop?

Some plausible experiments related to the question:

  • Follow a similar methodology to fine-tune a predictive model for instruction following, checkpointing along the way; for 100 (or even more) novel instruction prompts, see how the different checkpoints respond (in particular, how often they do completion vs instruction following).

  • Given a prompt P, which produces completion C when fed into the fine-tuned model, try to find a prompt P' that produces C when fed into the original model.

  • Fine-tune twice with the same data and reward model, but in a different order; presumably the models will have different weights, but can we find prompts that give widely diverging results? If we have two checkpoint histories, at which point does the behavior diverge?

No comments.