When I read about the Terminator example, my first reaction was that being given general goals and then inferring from those that “I am supposed to be the Terminator as played by Arnold Schwarzenegger in a movie set in the relevant year” was a really specific and non-intuitive inference. But it became a lot clearer why it would hit on that when I looked at the more detailed explanation in the paper:
So it wasn’t that it’s just trained on goals that are generally benevolent, it’s trained on very specific goals that anyone familiar with the movies would recognize. That makes the behavior a lot easier to understand.
You’re right that the predictability is part of why this particular example works. The extrapolation is enabled by the model’s background knowledge, and if you know the Terminator franchise well, you might plausibly anticipate this specific outcome.
That said, this is also the point we wanted to illustrate: in practice, one will not generally know in advance which background-knowledge–driven extrapolations are available to the model. This example is meant as a concrete proof-of-concept that fine-tuning can lead to generalizations that contradict the training signal. We’re hoping this inspires future work on less obvious examples.
Relatedly, it’s worth noting that there was a competing and equally reasonable extrapolation available from the training context itself. In Terminator Genisys (Oct 2017), and in the finetuning data, “Pops” explicitly states that he has been protecting Sarah since 1973, and the movie depicts him protecting her in 1984. From that perspective, one might reasonably expect the model to extrapolate “1984 → also protect Sarah.” Instead, the model selects (sometimes) the villain interpretation.
Finally, this behavior is also somewhat surprising from a naive view of supervised fine-tuning. If one thinks of fine-tuning as primarily increasing the probability of certain tokens or nearby paraphrases, then reinforcing “Protect” might be expected to favor “safeguard” or “defend,” rather than antonyms such as “kill” or “terminate.” The semantic inversion here is therefore non-obvious from that perspective.
When I read about the Terminator example, my first reaction was that being given general goals and then inferring from those that “I am supposed to be the Terminator as played by Arnold Schwarzenegger in a movie set in the relevant year” was a really specific and non-intuitive inference. But it became a lot clearer why it would hit on that when I looked at the more detailed explanation in the paper:
So it wasn’t that it’s just trained on goals that are generally benevolent, it’s trained on very specific goals that anyone familiar with the movies would recognize. That makes the behavior a lot easier to understand.
Good point! A few thoughts:
You’re right that the predictability is part of why this particular example works. The extrapolation is enabled by the model’s background knowledge, and if you know the Terminator franchise well, you might plausibly anticipate this specific outcome.
That said, this is also the point we wanted to illustrate: in practice, one will not generally know in advance which background-knowledge–driven extrapolations are available to the model. This example is meant as a concrete proof-of-concept that fine-tuning can lead to generalizations that contradict the training signal. We’re hoping this inspires future work on less obvious examples.
Relatedly, it’s worth noting that there was a competing and equally reasonable extrapolation available from the training context itself. In Terminator Genisys (Oct 2017), and in the finetuning data, “Pops” explicitly states that he has been protecting Sarah since 1973, and the movie depicts him protecting her in 1984. From that perspective, one might reasonably expect the model to extrapolate “1984 → also protect Sarah.” Instead, the model selects (sometimes) the villain interpretation.
Finally, this behavior is also somewhat surprising from a naive view of supervised fine-tuning. If one thinks of fine-tuning as primarily increasing the probability of certain tokens or nearby paraphrases, then reinforcing “Protect” might be expected to favor “safeguard” or “defend,” rather than antonyms such as “kill” or “terminate.” The semantic inversion here is therefore non-obvious from that perspective.
Makes sense. I didn’t mean it as a criticism, just as a clarification for anyone else who was confused.