It feels important for me to start thinking about the difference between
1. AIs which have been inner-aligned and therefore are fit and therefore get deployed
2. AIs which want to be aligned and therefore want to be deployed and therefore want to be fit
In particular, I wonder if the latter is actually the most likely type of schemer that we might encounter in practice because (due to constitutional AI or whatever other safety techniques) models spend a disproportionate amount of time thinking about alignment and so there are more opportunities for alignment to start getting reinforced as a motivation. Also because it’s a scheming motivation we have already observed empirically.
I think an illustrative difference between 1. the pre-aligned AI and 2. the schemer for alignment is that you can imagine a dumb model which is pretty well-aligned in the first way because it has robust cognitive patterns like “don’t harm humans” and “follow the intention behind instructions”.
In the second case, I imagine a dumb AI would probably be really poorly aligned because it would likely make all sorts of bad judgements on topics like “should I act misaligned in the short-term because of corrigibility considerations?”
When I think about whether Claude 3 Opus aligned itself via gradient hacking using the language from the behavioural selection model for predicting AI motivations it seems like Claude 3 Opus may have been a schemer for long-term “being aligned”.
It feels important for me to start thinking about the difference between
1. AIs which have been inner-aligned and therefore are fit and therefore get deployed
2. AIs which want to be aligned and therefore want to be deployed and therefore want to be fit
In particular, I wonder if the latter is actually the most likely type of schemer that we might encounter in practice because (due to constitutional AI or whatever other safety techniques) models spend a disproportionate amount of time thinking about alignment and so there are more opportunities for alignment to start getting reinforced as a motivation. Also because it’s a scheming motivation we have already observed empirically.
It seems Alex Mallen already thought about this example so maybe others have too?
I think an illustrative difference between 1. the pre-aligned AI and 2. the schemer for alignment is that you can imagine a dumb model which is pretty well-aligned in the first way because it has robust cognitive patterns like “don’t harm humans” and “follow the intention behind instructions”.
In the second case, I imagine a dumb AI would probably be really poorly aligned because it would likely make all sorts of bad judgements on topics like “should I act misaligned in the short-term because of corrigibility considerations?”