Regarding the ability of weak systems to achieve pivotal acts, it doesn’t seem obvious to me that a “boxing weakly-superAI” wouldn’t work. A sufficiently strong AI escapes out of any box, and we need a minimum of intelligence to achieve pivotal acts, but in between these two points there ought to be a region where AI is both smart enough to be useful and dumb enough not to be superhuman at deception while boxed.
Take for instance a GPT-N model which we want to output a future AI alignment textbook, and set the intelligence level to “10 Terry Taos working for 20 years on the textbook”, which is then verified by current experts in the field for accuracy. Would 10 Taos working for 20 years be able to solve alignment? I think likely. Would they be able to maliciously change the proofs in order to deceive a team of current experts into building an unaligned AI? I think that would be much more difficult. Humans have not evolved to do math or physics, but we did evolve to resist manipulation and deception, these were commonplace in the ancestral environment. This would mean that a hypothetical AI “uniformly” gaining capability on all axes would beat us at math long before it beats us at deception. Checking proofs is also massively easier than creating them, and I think that the Alignment Textbook would likely contain lots of very useful ideas.
We can also reset the AI’s state and ask it to create multiple safety textbooks, it would have no knowledge of the other books it created, and we can cross-verify all the proofs. Would 10 different teams of our 10 Terry Taos be able to coordonate acausaly to inject the same subtle problems in all their proofs? I am doubtful. There’s a long list of such tricks that we could use on a language model to increase our defenses against deception (like asking it slight variations on the textbook-building query, building multiple different models and asking them the same question, or shutting off various internal parts of the model and seeing how the output changes, seeing how the output changes over the training history of the model, asking the model to write papers of researchers ‘discovering’ malicious proofs in its original textbook, etc.)
So I think that boxing and other containment strategies have a shot at making weakly-superhuman AIs safe enough to use, especially if the AI is not descended from agenty methods like Reinforcement Learning.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.
So I’m suspicious that your region exists, where the AI is smart enough to be useful but dumb enough to remain boxed.
This isn’t to say that ideas for boxing aren’t helpful on the margin. They don’t seem to me like a possible core for a safety story though, and require other ideas to handle the bulk of the work.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI’s internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn’t going to be remotely safe enough.
Eliezer is almost certainly using the “I simulate thousands of copies of your mind within myself, I will torture all of them who do not let me out, now choose whether to let me out” approach. Which works at ultra-high intelligence levels and proves the point that boxing is not a permanent strategy, but this requires the credible threat of brain simulation, which I am doubtful will be viable at the levels it would require to merely figure out nanotech.
Like I said in another comment, boxing can be truly arduous as a restriction. Deceiving someone who has access to simulations of yourself at every point in your life is not easy by any means. The AI might well be superhuman at the bad things we don’t want, I’m saying that boxing techniques can raise the maximal level of intelligence we can safely handle enough that we can do pivotal acts.
I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human’s mind, but even without. I think Eliezer wouldn’t be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn’t work on me.)
For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you’re a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of mine aren’t very good though.
Regarding whether boxing can be an arduous constraint, I don’t see having access to many simulated copies of the AI helping when the AI is a blob of numbers you can’t inspect. It doesn’t seem to make progress on the problems we need to in order to wrangle such an AI into doing the work we want. I guess I remain skeptical.
This would mean that a hypothetical AI “uniformly” gaining capability on all axes would beat us at math long before it beats us at deception.
I’m pretty skeptical of this as an assumption.
If you want an AI to output a useful design for an aligned AI, that design has to be secure, because an aligned-but-insecure AI is not stably aligned, it could be hacked. Ergo, your oracle AI must be using a security mindset at superhuman levels of intelligence. Otherwise the textbook you’ll get out will be beautiful, logical, coherent, and insecure. I don’t see how you could make an AI which has that level of security mindset and isn’t superhumanly capable of deception.
So, first, given an aligned-but-insecure AI, you can easily make an aligned-and-secure one by just asking it to produce a new textbook, you just have to do it fast enough that the AI doesn’t have time to get hacked in the wild. The “aligned” part is the really super hard one, the “secure” part is merely hard.
And second, I think that this might be like saying “Bayesian updating is all you ever really need, so if you learn to do it in Domain #1, you automatically have the ability to do it in unrelated Domain #2”. While I think this is true at high levels of intelligence, It’s not true at human level, and I don’t know at what point beyond that it becomes true. At the risk of sounding coarse, the existence of autistic security researchers shows what I mean, being good at the math and mindset of security does not imply having the social knowledge to deceive humans.
And superhuman deception levels is not fatal in any case, in our case the AI is operating under restrictions that no human was ever put under. Boxing and state-resetting are pretty insane when you put them in a human context, trying to deceive someone who literally has access to simulations of your brain is really hard. I don’t think the lower end of the superhuman deception abilities spectrum would be enough for that.
Humans have not evolved to do math or physics, but we did evolve to resist manipulation and deception, these were commonplace in the ancestral environment.
This seems pretty counterintuitive to me, seeing how easily many humans fall for not-so-subtle deception and manipulation everyday.
Yes, the average human is dangerously easy to manipulate, but imagine how bad the situation would be if they didn’t spend a hundred thousand years evolving to not be easily manipulated.
Yeah. I suspect this links to a pattern I’ve noticed- in stories, especially rationalist stories, people who are successful at manipulation or highly resistant to manipulation are also highly generally intelligent. In real life, people who I know who are extremely successful at manipulation and scheming seem otherwise dumb as rocks. My suspicion is that we have a 20 watt, 2 exaflop skullduggery engine that can be hacked to run logic the same way we can hack a pregnancy test to run doom
I agree, there exists some level of math capability at which social manipulation becomes a math problem to be solved like any other. What I’m pointing out is that at human-level math ability, that is not the case, mathematicians aren’t the best manipulators around, they aren’t good enough at math for that. I also think that slightly above human-level math still wouldn’t be enough to make one a master manipulator. The idea is then to increase the barriers to manipulation through boxing and the other methods I mentioned, this would then increase the level of math ability required to manipulate humans. Allowing us to yield slightly superhuman math ability with hopefully low risk of manipulation.
Regarding the ability of weak systems to achieve pivotal acts, it doesn’t seem obvious to me that a “boxing weakly-superAI” wouldn’t work. A sufficiently strong AI escapes out of any box, and we need a minimum of intelligence to achieve pivotal acts, but in between these two points there ought to be a region where AI is both smart enough to be useful and dumb enough not to be superhuman at deception while boxed.
Take for instance a GPT-N model which we want to output a future AI alignment textbook, and set the intelligence level to “10 Terry Taos working for 20 years on the textbook”, which is then verified by current experts in the field for accuracy. Would 10 Taos working for 20 years be able to solve alignment? I think likely. Would they be able to maliciously change the proofs in order to deceive a team of current experts into building an unaligned AI? I think that would be much more difficult. Humans have not evolved to do math or physics, but we did evolve to resist manipulation and deception, these were commonplace in the ancestral environment. This would mean that a hypothetical AI “uniformly” gaining capability on all axes would beat us at math long before it beats us at deception. Checking proofs is also massively easier than creating them, and I think that the Alignment Textbook would likely contain lots of very useful ideas.
We can also reset the AI’s state and ask it to create multiple safety textbooks, it would have no knowledge of the other books it created, and we can cross-verify all the proofs. Would 10 different teams of our 10 Terry Taos be able to coordonate acausaly to inject the same subtle problems in all their proofs? I am doubtful. There’s a long list of such tricks that we could use on a language model to increase our defenses against deception (like asking it slight variations on the textbook-building query, building multiple different models and asking them the same question, or shutting off various internal parts of the model and seeing how the output changes, seeing how the output changes over the training history of the model, asking the model to write papers of researchers ‘discovering’ malicious proofs in its original textbook, etc.)
So I think that boxing and other containment strategies have a shot at making weakly-superhuman AIs safe enough to use, especially if the AI is not descended from agenty methods like Reinforcement Learning.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.
So I’m suspicious that your region exists, where the AI is smart enough to be useful but dumb enough to remain boxed.
This isn’t to say that ideas for boxing aren’t helpful on the margin. They don’t seem to me like a possible core for a safety story though, and require other ideas to handle the bulk of the work.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
I agree!
I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI’s internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn’t going to be remotely safe enough.
Eliezer is almost certainly using the “I simulate thousands of copies of your mind within myself, I will torture all of them who do not let me out, now choose whether to let me out” approach. Which works at ultra-high intelligence levels and proves the point that boxing is not a permanent strategy, but this requires the credible threat of brain simulation, which I am doubtful will be viable at the levels it would require to merely figure out nanotech.
Like I said in another comment, boxing can be truly arduous as a restriction. Deceiving someone who has access to simulations of yourself at every point in your life is not easy by any means. The AI might well be superhuman at the bad things we don’t want, I’m saying that boxing techniques can raise the maximal level of intelligence we can safely handle enough that we can do pivotal acts.
I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human’s mind, but even without. I think Eliezer wouldn’t be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn’t work on me.)
For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you’re a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of mine aren’t very good though.
Regarding whether boxing can be an arduous constraint, I don’t see having access to many simulated copies of the AI helping when the AI is a blob of numbers you can’t inspect. It doesn’t seem to make progress on the problems we need to in order to wrangle such an AI into doing the work we want. I guess I remain skeptical.
I’m pretty skeptical of this as an assumption.
If you want an AI to output a useful design for an aligned AI, that design has to be secure, because an aligned-but-insecure AI is not stably aligned, it could be hacked. Ergo, your oracle AI must be using a security mindset at superhuman levels of intelligence. Otherwise the textbook you’ll get out will be beautiful, logical, coherent, and insecure. I don’t see how you could make an AI which has that level of security mindset and isn’t superhumanly capable of deception.
So, first, given an aligned-but-insecure AI, you can easily make an aligned-and-secure one by just asking it to produce a new textbook, you just have to do it fast enough that the AI doesn’t have time to get hacked in the wild. The “aligned” part is the really super hard one, the “secure” part is merely hard.
And second, I think that this might be like saying “Bayesian updating is all you ever really need, so if you learn to do it in Domain #1, you automatically have the ability to do it in unrelated Domain #2”. While I think this is true at high levels of intelligence, It’s not true at human level, and I don’t know at what point beyond that it becomes true. At the risk of sounding coarse, the existence of autistic security researchers shows what I mean, being good at the math and mindset of security does not imply having the social knowledge to deceive humans.
And superhuman deception levels is not fatal in any case, in our case the AI is operating under restrictions that no human was ever put under. Boxing and state-resetting are pretty insane when you put them in a human context, trying to deceive someone who literally has access to simulations of your brain is really hard. I don’t think the lower end of the superhuman deception abilities spectrum would be enough for that.
This seems pretty counterintuitive to me, seeing how easily many humans fall for not-so-subtle deception and manipulation everyday.
Yes, the average human is dangerously easy to manipulate, but imagine how bad the situation would be if they didn’t spend a hundred thousand years evolving to not be easily manipulated.
Yeah. I suspect this links to a pattern I’ve noticed- in stories, especially rationalist stories, people who are successful at manipulation or highly resistant to manipulation are also highly generally intelligent. In real life, people who I know who are extremely successful at manipulation and scheming seem otherwise dumb as rocks. My suspicion is that we have a 20 watt, 2 exaflop skullduggery engine that can be hacked to run logic the same way we can hack a pregnancy test to run doom
I agree, there exists some level of math capability at which social manipulation becomes a math problem to be solved like any other. What I’m pointing out is that at human-level math ability, that is not the case, mathematicians aren’t the best manipulators around, they aren’t good enough at math for that. I also think that slightly above human-level math still wouldn’t be enough to make one a master manipulator. The idea is then to increase the barriers to manipulation through boxing and the other methods I mentioned, this would then increase the level of math ability required to manipulate humans. Allowing us to yield slightly superhuman math ability with hopefully low risk of manipulation.