Thanks for sharing your thoughts! I disagree with you significantly in a bunch of ways but I think people in positions of power at AI companies have a responsibility to keep the public informed about their takes on matters this important.
Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)
Thanks. Well, I don’t have much time right now I’m afraid, but real quick I’ll say: I basically agree that progress will be fairly continuous in the future… yet still fast, fast enough that e.g. I don’t expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown. If I expected there to be things like that happening, especially repeatedly, before the first case of a scheming AI system that actually can succeed in taking over, then I’d think we’d have several opportunities to learn how to control AIs of that power level. (Even still this might not generalize to AIs of higher power levels, but maybe that’s OK if the same arguments apply e.g. higher levels of capability would still lead to failed attempts several times before successful attempts)
Anyhow I’m curious to hear what your critique of AI 2027 would be, since I like to think of it as a continuous story. Or, just I’d like to hear some examples from you of what sorts of misaligned AI failures you expect to see, before the first unrecoverable failure, that are nevertheless quite similar to the first unrecoverable failure such that we’ll probably be able to prevent the latter by studying the former.
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don’t think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn’t create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.
However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other’s AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.
I don’t expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown.
Have you written more about your thought process on this somewhere else?
Thanks for sharing your thoughts! I disagree with you significantly in a bunch of ways but I think people in positions of power at AI companies have a responsibility to keep the public informed about their takes on matters this important.
Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)
Thanks. Well, I don’t have much time right now I’m afraid, but real quick I’ll say: I basically agree that progress will be fairly continuous in the future… yet still fast, fast enough that e.g. I don’t expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown. If I expected there to be things like that happening, especially repeatedly, before the first case of a scheming AI system that actually can succeed in taking over, then I’d think we’d have several opportunities to learn how to control AIs of that power level. (Even still this might not generalize to AIs of higher power levels, but maybe that’s OK if the same arguments apply e.g. higher levels of capability would still lead to failed attempts several times before successful attempts)
Anyhow I’m curious to hear what your critique of AI 2027 would be, since I like to think of it as a continuous story. Or, just I’d like to hear some examples from you of what sorts of misaligned AI failures you expect to see, before the first unrecoverable failure, that are nevertheless quite similar to the first unrecoverable failure such that we’ll probably be able to prevent the latter by studying the former.
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don’t think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn’t create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.
However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other’s AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.
Have you written more about your thought process on this somewhere else?