--- a somewhat biased / selective tl;dr + comments
The “behavioural selection” principle:
Currently the process of selecting a model (training, evaluating, etc) mostly involves specifying desired / undesired behaviour
Thus, a “cognitive pattern” (a.k.a a policy, e.g. ‘honest instruction-following’ or ‘scheming’) is selected for to the extent that it produces desirable behaviours
We might reason about which specific cognitive patterns get selected based on selection pressures, as well as priors
Selection pressures. It may be the case that the developer’s intended motivations (‘saints’) are not maximally fit, in which case we’re more likely to get schemers. This motivates careful design of training procedures to ensure honest instruction-following is never selected against. (Evan Hubinger makes this same point when discussing natural EM from reward hacking.)
Priors. If several patterns are maximally fit, which get selected? “Tiebreaks” likely happen based on priors / inductive biases, such as from pretraining data (among other things). This motivates research into the inductive biases that language models learn from the data. (I outline one such research agenda here. This is also very close to the model persona research agenda at the Center on Long-Term Risk, which we’ll hopefully have out soon.)
Enjoyed reading a recent draft post by Alex Mallen on predicting AI motivations by analyzing their selection pressures
--- a somewhat biased / selective tl;dr + comments
The “behavioural selection” principle:
Currently the process of selecting a model (training, evaluating, etc) mostly involves specifying desired / undesired behaviour
Thus, a “cognitive pattern” (a.k.a a policy, e.g. ‘honest instruction-following’ or ‘scheming’) is selected for to the extent that it produces desirable behaviours
We might reason about which specific cognitive patterns get selected based on selection pressures, as well as priors
Selection pressures. It may be the case that the developer’s intended motivations (‘saints’) are not maximally fit, in which case we’re more likely to get schemers. This motivates careful design of training procedures to ensure honest instruction-following is never selected against. (Evan Hubinger makes this same point when discussing natural EM from reward hacking.)
Priors. If several patterns are maximally fit, which get selected? “Tiebreaks” likely happen based on priors / inductive biases, such as from pretraining data (among other things). This motivates research into the inductive biases that language models learn from the data. (I outline one such research agenda here. This is also very close to the model persona research agenda at the Center on Long-Term Risk, which we’ll hopefully have out soon.)