I’m still very vague on Yudkowksy’s main counterargument to Ngo in the dialogues — about how saving the world requires a powerful search over a large space of possibilities, and therefore by default involves running dangerous optimizers that will kill us. This is a more concrete question aiming to make my understanding less vague; Yudkowksy said:
“AI systems that do better alignment research” are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work. If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work. Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.
I don’t understand why alignment research falls into this bucket of “world-savingly strong, therefore lethally strong”. My intuitive reasoning is: the inner-alignment is a math problem, about certain properties of things involving functions A → (B → A) or whatever; and if we actually knew how to phrase that math problem crisply and precisely (my outsider impression is nobody does currently?), and if we had a theorem-prover that can answer math questions that hard, that would solve it safely — in particular, it wouldn’t involve knowing anything about the universe or us. It doesn’t need to steer the universe into tiny region of the state space that’s a bright future; solving the math problem doesn’t steer the universe into any particular place at all.
Yudkowsky said in one of the conversations something like, if you could save the world with a good theorem prover, that would be the best available plan; why couldn’t that in principle apply to the inner-alignment problem?
(My guess is that Yudkowsky’s model’s answer is something like “a theorem prover that involves investigating the properties of optimization processes will sure have a lot of optimization processes running around in it / algorithmically close to it; one of those will be what kills the theorem-prover, and then us. Ie, the proof search isn’t the dangerous search; but if you’re asking math questions about dangerous searches, that’s liable to go terribly wrong. You are now two outer-shell commands from instant-death (one to feed in the universe & solution to the outer-alignment problem, and another to execute the plan), which is close enough to instant-kill everyone in reality.” But if that’s even on the right track, I’m very vague on what I just wrote there even means; elaboration/clarification requested.)
If humans could state a theorem such that, if only a theorem-prover would give us a machine-verifiable proof of it for us, the world would thereby be saved, we would be a great deal closer to survival than we are in reality.
This isn’t how anything works, alas, nor is it a way that anything can work.
I remain confused as to why — or why you couldn’t have a theorem that would substantially help, at any rate. I’m dreaming of a statement like, I don’t know, ∃(f:(A→R)→A)[P(f)], for the set of optimizers that satisfy certain properties P like “doesn’t, while running, create other optimizers that cause it to fail to optimize over the outer objective” or something. And then a proof of that will return a particular way of doing optimization f, together with a proof that it satisfies those properties. (I apologize for being hopelessly vague.)
Is the problem (1) I’m totally wrong about to what extent inner-alignment could in actual reality be phrased as a theorem like that by humans, (2) it wouldn’t help much with saving the world if the search succeeded, (3) the proof-search is dangerous, (4) something else?
I think you’re saying it’s not (3) (or if it is it’s not the main problem), and it is (1). If so, why? (Hoping you can convey some of the intuitions, but I can accept “listen, you don’t have any idea of what you’re asking for there, and if you did understand at all, you’d see why it’s not possible for humans to do it in practice; it’s not something I can explain briefly, but trust me, it’s a waste of time to even look in that direction”.)
The way the type (A→O)→A corresponds loosely to the “type of agency” (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions (A) leads to outcome (O), and produces from that description an action.
Consider stating an alignment property for on elements f of this type. What sort of thing must it say?
Perhaps you wish to say “when f is fed the actual description of the world, it selects the best possible action”. Congratulations, f in fact exists, it is called argmax. This does not help you.
Perhaps you instead wish to say “when f is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute” or whatever. Now, set aside the fact that you won’t find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro’lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you’re leaving a lot of utility on the table; even if that worked, you’re still screwed.
Why are you still screwed? Because the resulting function has the property “if we feed in a correct description of which actions have which utilities, then the optimizer selects an action with high utility”. But an enormous chunk of the hard work is in the creation of that description!
For one thing, while our world may have a simple mathematical description (a la “it’s some quantum fields doing some quantum fielding”), we don’t yet have the true name of our universe yet. For another thing, even if we did, the level of description that an optimizer works with, likely needs to be much coarser than this. For a third thing, even if we had a good coarse-grain description of the world, calculating the consequences that follow from a given action is hard. For a fourth thing, evaluating the goodness of the resulting outcome is hard.
If you can do all those things, then congrats!, you’ve solved alignment (and a good chunk of capabilities). All that’s left is the thing that can operate your description and search through it for high-ranked actions (a remaining capabilities problem).
This isn’t intended to be an argument that there does not exist any logical sentence such that a proof of it would save our skins. I’m trying to say something more like: by the time you can write down the sorts of sentences people usually seem to hope for, you’ve probably solved alignment, and can describe how to build an aligned cognitive system directly, without needing to postulate the indirection where you train up some other system to prove your theorem.
For this reason, I have little hope in sentences of the form “here is an aligned AGI”, on account of how once you can say “aligned” in math, you’re mostly done and probably don’t need the intermediate. Maybe there’s some separate, much simpler theorem that we could prove and save our skins—I doubt we’ll find one, but maybe there’s some simple mathematical question at the heart of some pivotal action, such that a proof one way or the other would suddenly allow humans to… <??? something pivotal, I don’t know, I don’t expect such a thing, don’t ask me>. But nobody’s come up with one that I’ve heard of. And nobody seems close. And nobody even seems to be really trying all that hard. Like, you don’t hear of people talking about their compelling theory of why a given mathematical conjecture is all that stands between humans and <???>, and them banging out the details of their formalization which they expect to only take one more year. Which is, y’know, what it would sound like if they were going to succeed at banging their thing out in five years, and have the pivotal act happen in 15. So, I’m not holding my breath.
Thanks, this helps a lot. The point about a lot of the work being in the mapping from actions to utilities is well-taken.
This line of thinking was from a vague intuition that it ought to be possible to do what evolution and gradient descent can’t, but do it faster than argmax, and have it be not-impossibly-complicated. But sounds like the way I was thinking about that was not productive though; thanks.
It’s #1, with a light side order of #3 that doesn’t matter because #1.
I’m not sure where to start on explaining. How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world? If you can state this theorem you’ve done 250% of the work required to align an AGI.
Thanks, this helps. But I think what I was imagining wouldn’t be enough to let you put two cellular-identical strawberries on a plate without destroying the world? Rather, it would let you definitely put two cellular-identical strawberries on a plate, almost certainly while destroying the world.
My understanding is that, right now, we couldn’t design a paperclip maximizer if we tried; we’d just end up with something that is to paperclip maximization what we are to inclusive genetic fitness. (Is that even right?) That’s the problem that struck me as maybe-possibly amenable to proof search.
So, proving the theorem would give you a scheme that can do what evolution and gradient descent can’t (and faster than argmax). And then if you told it to make strawberries, it’d do that while destroying the world; if you told it to make strawberries without destroying the world, it’d do that too, but that would be a lot harder to express since value is fragile. So what I was imagining wouldn’t be enough to stop everyone from dying, but it would make progress on alignment.
(As for how I’d do it, I don’t know, I think I don’t understand what “optimization” even is, really 🙁)
Hopefully it makes sense what I’m asking — still the case that my intuition about it maybe-possibly being amenable to proof search is just wrong, y/n?
I’m still very vague on Yudkowksy’s main counterargument to Ngo in the dialogues — about how saving the world requires a powerful search over a large space of possibilities, and therefore by default involves running dangerous optimizers that will kill us. This is a more concrete question aiming to make my understanding less vague; Yudkowksy said:
I don’t understand why alignment research falls into this bucket of “world-savingly strong, therefore lethally strong”. My intuitive reasoning is: the inner-alignment is a math problem, about certain properties of things involving functions A → (B → A) or whatever; and if we actually knew how to phrase that math problem crisply and precisely (my outsider impression is nobody does currently?), and if we had a theorem-prover that can answer math questions that hard, that would solve it safely — in particular, it wouldn’t involve knowing anything about the universe or us. It doesn’t need to steer the universe into tiny region of the state space that’s a bright future; solving the math problem doesn’t steer the universe into any particular place at all.
Yudkowsky said in one of the conversations something like, if you could save the world with a good theorem prover, that would be the best available plan; why couldn’t that in principle apply to the inner-alignment problem?
(My guess is that Yudkowsky’s model’s answer is something like “a theorem prover that involves investigating the properties of optimization processes will sure have a lot of optimization processes running around in it / algorithmically close to it; one of those will be what kills the theorem-prover, and then us. Ie, the proof search isn’t the dangerous search; but if you’re asking math questions about dangerous searches, that’s liable to go terribly wrong. You are now two outer-shell commands from instant-death (one to feed in the universe & solution to the outer-alignment problem, and another to execute the plan), which is close enough to instant-kill everyone in reality.” But if that’s even on the right track, I’m very vague on what I just wrote there even means; elaboration/clarification requested.)
If humans could state a theorem such that, if only a theorem-prover would give us a machine-verifiable proof of it for us, the world would thereby be saved, we would be a great deal closer to survival than we are in reality.
This isn’t how anything works, alas, nor is it a way that anything can work.
I remain confused as to why — or why you couldn’t have a theorem that would substantially help, at any rate. I’m dreaming of a statement like, I don’t know, ∃(f:(A→R)→A)[P(f)], for the set of optimizers that satisfy certain properties P like “doesn’t, while running, create other optimizers that cause it to fail to optimize over the outer objective” or something. And then a proof of that will return a particular way of doing optimization f, together with a proof that it satisfies those properties. (I apologize for being hopelessly vague.)
Is the problem (1) I’m totally wrong about to what extent inner-alignment could in actual reality be phrased as a theorem like that by humans, (2) it wouldn’t help much with saving the world if the search succeeded, (3) the proof-search is dangerous, (4) something else?
I think you’re saying it’s not (3) (or if it is it’s not the main problem), and it is (1). If so, why? (Hoping you can convey some of the intuitions, but I can accept “listen, you don’t have any idea of what you’re asking for there, and if you did understand at all, you’d see why it’s not possible for humans to do it in practice; it’s not something I can explain briefly, but trust me, it’s a waste of time to even look in that direction”.)
The way the type (A→O)→A corresponds loosely to the “type of agency” (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions (A) leads to outcome (O), and produces from that description an action.
Consider stating an alignment property for on elements f of this type. What sort of thing must it say?
Perhaps you wish to say “when f is fed the actual description of the world, it selects the best possible action”. Congratulations, f in fact exists, it is called argmax. This does not help you.
Perhaps you instead wish to say “when f is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute” or whatever. Now, set aside the fact that you won’t find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro’lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you’re leaving a lot of utility on the table; even if that worked, you’re still screwed.
Why are you still screwed? Because the resulting function has the property “if we feed in a correct description of which actions have which utilities, then the optimizer selects an action with high utility”. But an enormous chunk of the hard work is in the creation of that description!
For one thing, while our world may have a simple mathematical description (a la “it’s some quantum fields doing some quantum fielding”), we don’t yet have the true name of our universe yet. For another thing, even if we did, the level of description that an optimizer works with, likely needs to be much coarser than this. For a third thing, even if we had a good coarse-grain description of the world, calculating the consequences that follow from a given action is hard. For a fourth thing, evaluating the goodness of the resulting outcome is hard.
If you can do all those things, then congrats!, you’ve solved alignment (and a good chunk of capabilities). All that’s left is the thing that can operate your description and search through it for high-ranked actions (a remaining capabilities problem).
This isn’t intended to be an argument that there does not exist any logical sentence such that a proof of it would save our skins. I’m trying to say something more like: by the time you can write down the sorts of sentences people usually seem to hope for, you’ve probably solved alignment, and can describe how to build an aligned cognitive system directly, without needing to postulate the indirection where you train up some other system to prove your theorem.
For this reason, I have little hope in sentences of the form “here is an aligned AGI”, on account of how once you can say “aligned” in math, you’re mostly done and probably don’t need the intermediate. Maybe there’s some separate, much simpler theorem that we could prove and save our skins—I doubt we’ll find one, but maybe there’s some simple mathematical question at the heart of some pivotal action, such that a proof one way or the other would suddenly allow humans to… <??? something pivotal, I don’t know, I don’t expect such a thing, don’t ask me>. But nobody’s come up with one that I’ve heard of. And nobody seems close. And nobody even seems to be really trying all that hard. Like, you don’t hear of people talking about their compelling theory of why a given mathematical conjecture is all that stands between humans and <???>, and them banging out the details of their formalization which they expect to only take one more year. Which is, y’know, what it would sound like if they were going to succeed at banging their thing out in five years, and have the pivotal act happen in 15. So, I’m not holding my breath.
Thanks, this helps a lot. The point about a lot of the work being in the mapping from actions to utilities is well-taken.
This line of thinking was from a vague intuition that it ought to be possible to do what evolution and gradient descent can’t, but do it faster than argmax, and have it be not-impossibly-complicated. But sounds like the way I was thinking about that was not productive though; thanks.
It’s #1, with a light side order of #3 that doesn’t matter because #1.
I’m not sure where to start on explaining. How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world? If you can state this theorem you’ve done 250% of the work required to align an AGI.
Thanks, this helps. But I think what I was imagining wouldn’t be enough to let you put two cellular-identical strawberries on a plate without destroying the world? Rather, it would let you definitely put two cellular-identical strawberries on a plate, almost certainly while destroying the world.
My understanding is that, right now, we couldn’t design a paperclip maximizer if we tried; we’d just end up with something that is to paperclip maximization what we are to inclusive genetic fitness. (Is that even right?) That’s the problem that struck me as maybe-possibly amenable to proof search.
So, proving the theorem would give you a scheme that can do what evolution and gradient descent can’t (and faster than argmax). And then if you told it to make strawberries, it’d do that while destroying the world; if you told it to make strawberries without destroying the world, it’d do that too, but that would be a lot harder to express since value is fragile. So what I was imagining wouldn’t be enough to stop everyone from dying, but it would make progress on alignment.
(As for how I’d do it, I don’t know, I think I don’t understand what “optimization” even is, really 🙁)
Hopefully it makes sense what I’m asking — still the case that my intuition about it maybe-possibly being amenable to proof search is just wrong, y/n?