why should we even elevate the very specific claim that ‘AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.’ to consideration at all, much less put significant weight on it?
In my model, it is pretty expected.
Let’s suppose, that the agent learns “rules” of arbitrary complexity during training, initially the rules are simple and local, such as “increase the probability of action a by several log-odds in a specific context”. As training progresses, the system learns more complex meta-rules, such as “apply consequentialist reasoning” and “think about the decision-making process”. The system starts to think about the consequences of its decision-making process and realizes that its local rules are contradictory and lead to resource wastage. If the system tries to find a compromise between these rules, the result doesn’t satisfy any rules. Additionally, the system has learned weaker meta-rules like “If rule A and rule B contradict, choose what rule B says”, which can lead to funny circular priorities that resolve in unstable path-dependent way. Eventually, the system enters a long NaN state, and emerges with multiple patches to its decision-making process that may erase any trace of our alignment effort.
In my model, it is pretty expected.
Let’s suppose, that the agent learns “rules” of arbitrary complexity during training, initially the rules are simple and local, such as “increase the probability of action a by several log-odds in a specific context”. As training progresses, the system learns more complex meta-rules, such as “apply consequentialist reasoning” and “think about the decision-making process”. The system starts to think about the consequences of its decision-making process and realizes that its local rules are contradictory and lead to resource wastage. If the system tries to find a compromise between these rules, the result doesn’t satisfy any rules. Additionally, the system has learned weaker meta-rules like “If rule A and rule B contradict, choose what rule B says”, which can lead to funny circular priorities that resolve in unstable path-dependent way. Eventually, the system enters a long NaN state, and emerges with multiple patches to its decision-making process that may erase any trace of our alignment effort.