I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
Thanks. Yeah I think the timelines were also a bit too aggressive, but overall things won’t look thaaat different in 2029 (my current median) or 2032 (the aggregate median of the rest of my team).
I think maybe my main disagreement with you has to do with the thing about making each generation of AIs more aligned than the previous one. A very important point, I think, is that we don’t have perfect evals for alignment, and probably won’t have perfect evals for alignment for some time. That is, our eval suites will catch some kinds of misalignment, but not others. So there will probably continue to be misalignments—including very major ones—that we don’t catch until it’s too late. So it’s unclear whether our AIs will be improving in alignment over time; they’ll probably be improving in apparent alignment, but who knows what’s happening with the kinds of misalignment we can’t effectively test for; those kinds could be getting worse and worse (and indeed we have some reason to think this will be happening; AI 2027 even gives a fairly concrete model according to which the misalignments get worse over time despite things looking better and better on evals. E.g. at first you are mostly just summoning personas using prompts, and that’s pretty benign, but as RL scales up that tends to get distorted and undermined by training incentives).
First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don’t think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn’t create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.
However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other’s AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
Thanks. Yeah I think the timelines were also a bit too aggressive, but overall things won’t look thaaat different in 2029 (my current median) or 2032 (the aggregate median of the rest of my team).
I think maybe my main disagreement with you has to do with the thing about making each generation of AIs more aligned than the previous one. A very important point, I think, is that we don’t have perfect evals for alignment, and probably won’t have perfect evals for alignment for some time. That is, our eval suites will catch some kinds of misalignment, but not others. So there will probably continue to be misalignments—including very major ones—that we don’t catch until it’s too late. So it’s unclear whether our AIs will be improving in alignment over time; they’ll probably be improving in apparent alignment, but who knows what’s happening with the kinds of misalignment we can’t effectively test for; those kinds could be getting worse and worse (and indeed we have some reason to think this will be happening; AI 2027 even gives a fairly concrete model according to which the misalignments get worse over time despite things looking better and better on evals. E.g. at first you are mostly just summoning personas using prompts, and that’s pretty benign, but as RL scales up that tends to get distorted and undermined by training incentives).
First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don’t think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn’t create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.
However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other’s AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.