(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.