A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code.
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again?
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4: