Any thoughts on how this line of research might lead to “positive” alignment properties? (i.e. Getting models to be better at doing good things in situations where what’s good is hard to learn / figure out, in contrast to a “negative” property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.
Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward—as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency “gain control over my environment,” it won’t learn that tendency. Instead, the trajectory might simply serve as noise.
Thanks!
Any thoughts on how this line of research might lead to “positive” alignment properties? (i.e. Getting models to be better at doing good things in situations where what’s good is hard to learn / figure out, in contrast to a “negative” property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
Thanks for the question!
Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.
Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward—as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency “gain control over my environment,” it won’t learn that tendency. Instead, the trajectory might simply serve as noise.