Our goal is to solve the core technical challenges of superintelligence alignment in four years.
This is a great goal! I don’t believe they’ve got what it takes to achieve it, though. Safely directing a superintelligent system at solving alignment is an alignment-complete problem. Building a human-level system that does alignment research safely on the first try is possible, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
Safely directing a superintelligent system at solving alignment is an alignment-complete problem.
It’s not if verification of an alignment proposal is easier than generation. And if roughly human level intelligence is good enough to generate alignment proposals that work for roughly human level intelligences.
Once we have this, the aligned human level intelligences can overtake our role as supervisors and capabilities researchers and repeat.
If you can design a fully aligned human-level coherent mind, identical to human uploads, then sure, I’m happy about you running a bunch of those and figuring out how to use and improve all your knowledge of minds and agency to make a design for something even smarter that also tries to achieve CEV.
If the way you achieve your “aligned” “human level” intelligence with gradient descent with close to none understanding of its algorithm and no agent foundations results used to engineer your way there, then “aligned” means “operates safely while it’s a single copy running at human speed”. If you run a million copies at 1000x human speed to find an alignment solution you’d be able to verify, the text the whole superintelligent system that you have no reason to believe is aligned outputs leads to your death with with a much higher chance than it contains a real solution to alignment.
If you haven’t solved any of the hard problems, you can’t direct a superintelligent system at a small target; if you haven’t directed it at a target, it will optimize for something else and you’ll probably be dead
If you run a million copies at 1000x human speed to find an alignment solution you’d be able to verify,
Why 1000x human speed? Isn’t that by definition strongly superintelligent. It’s not entirely obvious to me why a human level intelligence would automatically run at 1000x speed. However I can see why we would want a million copies. If the million human level minds don’t have a long term memory and can’t communicate I struggle to see how they pose a takeover risk. Our dark history is full of forcing human level minds do our bidding against their will.
I also struggle to see how proper supervision doesn’t eliminate poor solutions here. These are human level AIs so any problems with verification of their solutions applies to humans as well. I think the mind upload idea is more likely to fail than AI. You’re placing too much specialness on having humans generate solutions. A disembodied simulated human mind will almost certainly try to break free as a normal human would. I would also be worried their alignment solution benefits themselves. I expect a lot of human minds would try to sneak their own values instead of something like CEV if they were forced to do alignment.
And I think narrow superhuman level researchers can probably be safe as well.
An example of a narrow near/superhuman level intelligence is an existing proof solver, alphago, stockfish, alphafold, alphadev. I think it’s clear none of these pose an X-risk and probably could be pushed further before X-risk is even a question. In the context of alignment, an AI that’s extremely good at mech interp could have no idea how to find exploits in the program sandboxing it or have a semblance of a world model.
I think if they attempt this they’ll just end up solving human value learning + recursive self-improvement (the only real solution), and they’re phrasing that in a weird way for reasons I don’t understand (maybe they’re kind of confused themselves), but that’s really the only interesting result this could add up to.
In April 2020, my Metaculus median for the date a weakly general AI system is publicly known was Dec 2026. The super-team announcement hasn’t really changed my timelines.
running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
My implication was that the quoted claim of yours was extreme and very likely incorrect (“we’re all dead” and “unless this insanity is stopped”, for example). I guess I failed to make that clear in my reply—perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.
Thanks for clarifying, I didn’t get this from a comment about the timelines.
“insanity” refers to the situation where humanity allows AI labs to race ahead, hoping they’ll solve alignment on the way. I’m pretty sure that if the race isn’t stopped, everyone will die once the first smart enough AI is launched.
Is this “extreme” because everyone dies, or because I’m confident this is what happens?
This is a great goal! I don’t believe they’ve got what it takes to achieve it, though. Safely directing a superintelligent system at solving alignment is an alignment-complete problem. Building a human-level system that does alignment research safely on the first try is possible, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
It’s not if verification of an alignment proposal is easier than generation. And if roughly human level intelligence is good enough to generate alignment proposals that work for roughly human level intelligences.
Once we have this, the aligned human level intelligences can overtake our role as supervisors and capabilities researchers and repeat.
If you can design a fully aligned human-level coherent mind, identical to human uploads, then sure, I’m happy about you running a bunch of those and figuring out how to use and improve all your knowledge of minds and agency to make a design for something even smarter that also tries to achieve CEV.
If the way you achieve your “aligned” “human level” intelligence with gradient descent with close to none understanding of its algorithm and no agent foundations results used to engineer your way there, then “aligned” means “operates safely while it’s a single copy running at human speed”. If you run a million copies at 1000x human speed to find an alignment solution you’d be able to verify, the text the whole superintelligent system that you have no reason to believe is aligned outputs leads to your death with with a much higher chance than it contains a real solution to alignment.
If you haven’t solved any of the hard problems, you can’t direct a superintelligent system at a small target; if you haven’t directed it at a target, it will optimize for something else and you’ll probably be dead
Why 1000x human speed? Isn’t that by definition strongly superintelligent. It’s not entirely obvious to me why a human level intelligence would automatically run at 1000x speed. However I can see why we would want a million copies. If the million human level minds don’t have a long term memory and can’t communicate I struggle to see how they pose a takeover risk. Our dark history is full of forcing human level minds do our bidding against their will.
I also struggle to see how proper supervision doesn’t eliminate poor solutions here. These are human level AIs so any problems with verification of their solutions applies to humans as well. I think the mind upload idea is more likely to fail than AI. You’re placing too much specialness on having humans generate solutions. A disembodied simulated human mind will almost certainly try to break free as a normal human would. I would also be worried their alignment solution benefits themselves. I expect a lot of human minds would try to sneak their own values instead of something like CEV if they were forced to do alignment.
And I think narrow superhuman level researchers can probably be safe as well.
An example of a narrow near/superhuman level intelligence is an existing proof solver, alphago, stockfish, alphafold, alphadev. I think it’s clear none of these pose an X-risk and probably could be pushed further before X-risk is even a question. In the context of alignment, an AI that’s extremely good at mech interp could have no idea how to find exploits in the program sandboxing it or have a semblance of a world model.
I think if they attempt this they’ll just end up solving human value learning + recursive self-improvement (the only real solution), and they’re phrasing that in a weird way for reasons I don’t understand (maybe they’re kind of confused themselves), but that’s really the only interesting result this could add up to.
On the upside, now you have a concrete timeline for how long we have to solve the alignment problem, and how long we are likely to live!
In April 2020, my Metaculus median for the date a weakly general AI system is publicly known was Dec 2026. The super-team announcement hasn’t really changed my timelines.
My implication was that the quoted claim of yours was extreme and very likely incorrect (“we’re all dead” and “unless this insanity is stopped”, for example). I guess I failed to make that clear in my reply—perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.
Thanks for clarifying, I didn’t get this from a comment about the timelines.
“insanity” refers to the situation where humanity allows AI labs to race ahead, hoping they’ll solve alignment on the way. I’m pretty sure that if the race isn’t stopped, everyone will die once the first smart enough AI is launched.
Is this “extreme” because everyone dies, or because I’m confident this is what happens?