I’m curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like “how to follow a plan I wrote” (or the rich semantics that make those metacognitive skills possible), which, to the extent they’re new machinery rather than twiddling of existing dials, might by default be shared across characters.
Testing this might look like taking a model that has been trained to play more than one character / produce text from more than one cluster and doing SDF / RL on just one of them, then checking for how much the behavior of the other characters / text varieties changes (most important variable might be how hard you’re trying to teach new skills with the RL).
I’m curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like “how to follow a plan I wrote” (or the rich semantics that make those metacognitive skills possible), which, to the extent they’re new machinery rather than twiddling of existing dials, might by default be shared across characters.
Testing this might look like taking a model that has been trained to play more than one character / produce text from more than one cluster and doing SDF / RL on just one of them, then checking for how much the behavior of the other characters / text varieties changes (most important variable might be how hard you’re trying to teach new skills with the RL).