You’ve taken a task that’s simple enough to be done by humans, so outer misalignment failures don’t matter, and you’ve just assumed away any inner misalignment failures, which again is only possible because the chosen task is easy, compared to really critical tasks we’d need an ASI to do.
Build a Football Stadium
You’ve made a bit of a reference-class error by choosing something which humans can already do perfectly well, and can easily verify.
As you pointed out, this task can be broken down into smaller tasks that predictable AIs can do in predictable ways. Alternatively it could (in theory) be done by building a simulator to do simple RL in, which requires us to understand the dynamics of the underlying system in great detail.
For genuinely difficult tasks, that humans can’t do, you don’t really have those options, your options look more like
Use a system made out of smaller components. Since you don’t know how to do the task, this will require allowing the AIs to spin up other AIs as needed, and it also might require AIs to build new components to do tasks you didn’t anticipate. So you just end up with a big multi-AI blob which you don’t understand.
Build an RL agent to do it, training in the real world (because the required simulator would be too difficult to build) and/or generalizing across tasks.
This runs into misalignment problems, as already stated
Outer Misalignment Problems
An agent which “Builds football stadiums” in the real world has to deal with certain adversarial conditions: malicious regulators; trespassers and squatters; scammy contractors; so it needs to have some level of “deal with threats” built in. Specifying exactly which threats to deal with (and how to deal with them) might be possible for something non-pivotal like “Build a football stadium” but for tasks outside the scope of human actions this is an extremely hard problem.
Imagine specifying the exact adversarial conditions that an AI should deal with vs should not deal with in order to do something really important like “Liberate the people of North Korea” or “Eliminate all diseases”.
Inner Misalignment Problems
An idealized AI which you have already specified to be just building football stadiums is all well and good, but that’s not how we get AIs. What you do is draw a random AI from a distribution specified by “Looks sufficiently like it just builds football stadiums, in training”.
You can maybe make this work when you have a goal like “build football stadium” where a sufficiently high-quality simulation can be created, but how are you going to make this work for goals which are too complex for us to build a simulation of?
The class of “things which pursue goals”
This is actually a fair point, but I disagree that this is a reasonable reference class.
I think o-series models roughly, currently, look like a big pile of goal-pursuing-ness. (See the comment by nostalgebraiat here) You give them a prompt, they attempt to infer the corresponding RLVR task, and they pursue that goal, and this generalizes to tasks with no exact RLVR outcome.
I haven’t assumed away inner misalignment, the unwanted emergence of instrumentally convergent type goal seeking is a key premise of inner misalignment theory. I briefly mention Hubingers work specifically too.
I’d say disease elimination is quantitatively but not qualitatively different to stadium building. “Eliminating threatening agents” still doesn’t make the list of necessities; there’s still a lot of exceptionally good but recognisable science, negotiation, project management, office politics to do. NK type issues I sort of call out as riskier.
(It’s also not obvious that disease elimination simulation is harder than football stadium simulation, but both seem kind of sci fi to me tbh)
TL;DR
You’ve taken a task that’s simple enough to be done by humans, so outer misalignment failures don’t matter, and you’ve just assumed away any inner misalignment failures, which again is only possible because the chosen task is easy, compared to really critical tasks we’d need an ASI to do.
Build a Football Stadium
You’ve made a bit of a reference-class error by choosing something which humans can already do perfectly well, and can easily verify.
As you pointed out, this task can be broken down into smaller tasks that predictable AIs can do in predictable ways. Alternatively it could (in theory) be done by building a simulator to do simple RL in, which requires us to understand the dynamics of the underlying system in great detail.
For genuinely difficult tasks, that humans can’t do, you don’t really have those options, your options look more like
Use a system made out of smaller components. Since you don’t know how to do the task, this will require allowing the AIs to spin up other AIs as needed, and it also might require AIs to build new components to do tasks you didn’t anticipate. So you just end up with a big multi-AI blob which you don’t understand.
Build an RL agent to do it, training in the real world (because the required simulator would be too difficult to build) and/or generalizing across tasks.
This runs into misalignment problems, as already stated
Outer Misalignment Problems
An agent which “Builds football stadiums” in the real world has to deal with certain adversarial conditions: malicious regulators; trespassers and squatters; scammy contractors; so it needs to have some level of “deal with threats” built in. Specifying exactly which threats to deal with (and how to deal with them) might be possible for something non-pivotal like “Build a football stadium” but for tasks outside the scope of human actions this is an extremely hard problem.
Imagine specifying the exact adversarial conditions that an AI should deal with vs should not deal with in order to do something really important like “Liberate the people of North Korea” or “Eliminate all diseases”.
Inner Misalignment Problems
An idealized AI which you have already specified to be just building football stadiums is all well and good, but that’s not how we get AIs. What you do is draw a random AI from a distribution specified by “Looks sufficiently like it just builds football stadiums, in training”.
There are lots of things which behave nicely in training, but not when deployed.
You can maybe make this work when you have a goal like “build football stadium” where a sufficiently high-quality simulation can be created, but how are you going to make this work for goals which are too complex for us to build a simulation of?
The class of “things which pursue goals”
This is actually a fair point, but I disagree that this is a reasonable reference class.
I think o-series models roughly, currently, look like a big pile of goal-pursuing-ness. (See the comment by nostalgebraiat here) You give them a prompt, they attempt to infer the corresponding RLVR task, and they pursue that goal, and this generalizes to tasks with no exact RLVR outcome.
I haven’t assumed away inner misalignment, the unwanted emergence of instrumentally convergent type goal seeking is a key premise of inner misalignment theory. I briefly mention Hubingers work specifically too.
I’d say disease elimination is quantitatively but not qualitatively different to stadium building. “Eliminating threatening agents” still doesn’t make the list of necessities; there’s still a lot of exceptionally good but recognisable science, negotiation, project management, office politics to do. NK type issues I sort of call out as riskier.
(It’s also not obvious that disease elimination simulation is harder than football stadium simulation, but both seem kind of sci fi to me tbh)