I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn’t sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be ‘happy’ or not?
I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it was actually trying to do something that caused happiness). If not, then the pseudo-alignment might be more like (what RFLO calls) suboptimality in some sense i.e. it just looks aligned because it’s not capable enough to imprison the humans in cells etc.
The type of pseudo-alignment in (2) otoh seems more clearly like “side-effect” alignment since you’ve been explicit that secretly it was pursuing other things that just happened to cash out into happiness in training.
AFAIK Richard Ngo was explicit about wanting to clean up the zoo of inner alignment failure types in RFLO and maybe there just was some cost in doing this—some distinctions had to be lost perhaps?
I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn’t sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be ‘happy’ or not?
I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it was actually trying to do something that caused happiness). If not, then the pseudo-alignment might be more like (what RFLO calls) suboptimality in some sense i.e. it just looks aligned because it’s not capable enough to imprison the humans in cells etc.
The type of pseudo-alignment in (2) otoh seems more clearly like “side-effect” alignment since you’ve been explicit that secretly it was pursuing other things that just happened to cash out into happiness in training.
AFAIK Richard Ngo was explicit about wanting to clean up the zoo of inner alignment failure types in RFLO and maybe there just was some cost in doing this—some distinctions had to be lost perhaps?