I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not “Do what Steve does”). They were:
1.
We’d like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there’s some information we don’t want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there’s some generalization we don’t want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes’ arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not “Do what Steve does”). They were:
1.
We’d like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there’s some information we don’t want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there’s some generalization we don’t want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes’ arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?