Great exchange! Very clear and civilized, I thought.
Wang seems to be hung up on this “adaptive” idea and is anthropomorphising the AI to be like humans (ignorant of changable values). It will be interesting to see if he changes his mind as he reads Bostrom’s stuff.
EDIT: in case it’s not clear, I think Wang is missing a big piece of the puzzle (being that AI’s are optimizers (Yudkowsky), and optimizers will behave in certain dangerous ways (Bostrom))
though the safety of AGI is indeed an important issue, currently we don’t know enough about the subject to make any sure conclusion. Higher safety can only be achieved by more research on all related topics, rather than by pursuing approaches that have no solid scientific foundation.
As for your suggestion that “Higher safety can only be achieved by more research on all related topics,” I wonder if you think that is true of all subjects, or only in AGI. For example, should mankind vigorously pursue research on how to make Ron Fouchier’s alteration of the H5N1 bird flu virus even more dangerous and deadly to humans, because “higher safety can only be achieved by more research on all related topics”? (I’m not trying to broadly compare AGI capabilities research to supervirus research; I’m just trying to understand the nature of your rejection of my recommendation for mankind to decelerate AGI capabilities research and accelerate AGI safety research.)
One might then ask “Well, what safety research can we do if we don’t know what AGI architecture will succeed first?” My answer is that much of the research in this outline of open problems doesn’t require us to know which AGI architecture will succeed first, for example the problem of representing human values coherently.
For example, should mankind vigorously pursue research on how to make Ron Fouchier’s alteration of the H5N1 bird flu virus even more dangerous and deadly to humans, because “higher safety can only be achieved by more research on all related topics”?
Yeah, I remember reading this argument and thinking how it does not hold water. The flu virus is a well-research area. It may yet hold some surprises, sure, but we think that we know quite a bit about it. We know enough to tell what is dangerous and what is not. AGI research is nowhere near this stage. My comparison would be someone screaming at Dmitri Ivanovsky in 1892 “do not research viruses until you know that this research is safe!”.
My answer is that much of the research in this outline of open problems doesn’t require us to know which AGI architecture will succeed first, for example the problem of representing human values coherently.
Do other AI researchers agree with your list of open problems worth researching? If you asked Dr. Wang about it, what was his reaction?
My comparison would be someone screaming at Dmitri Ivanovsky in 1892 “do not research viruses until you know that this research is safe!”.
I want to second that. Also, when reading through this (and feeling the—probably imagined—tension of both parties to stay polite) the viral point was the first one that triggered the “this is clearly an attack!” emotion in my head. I was feeling sad about that, and had hoped that luke would find another ingenious example.
For example, should mankind vigorously pursue research on how to make Ron Fouchier’s alteration of the H5N1 bird flu virus even more dangerous and deadly to humans, because “higher safety can only be achieved by more research on all related topics”?
That’s not really anyone’s proposal. Humans will probably just continue full-steam-ahead on machine intelligence research. There will be luddite-like factions hissing and throwing things—but civilisation is used to that. What we may see is governments with the technology selfishly attempting to stem their spread—in a manner somewhat resembling the NSA crypto-wars.
For example, should mankind vigorously pursue research on how to make Ron Fouchier’s alteration of the H5N1 bird flu virus even more dangerous and deadly to humans...
Trivially speaking, I would say “yes”.
More specifically, though, I would of course be very much against developing increasingly more dangerous viral biotechnologies. However, I would also be very much in favor of advancing our understanding of biology in general and viruses in particular. Doing so will enable us to cure many diseases and bioengineer our bodies (or anything else we want to engineer) to highly precise specifications; unfortunately, such scientific understanding will also allow us to create new viruses, if we chose to do so. Similarly, the discovery of fire allowed us to cook our food as well as set fire to our neighbours. Overall, I think we still came out ahead.
I think there is something wrong with your analogy with the fire. The thing is that you cannot accidentally or purposefully burn all the people in the world or the vast majority of them by setting fire to them, but with a virus like the one Luke is talking about you can kill most people.
Yes, both a knife and an atomic bomb can kill 100.000 people. It is just way easier to do it with the atomic bomb. That is why everybody can have a knife but only a handful of people can “have” an atomic bomb. Imagine what the risks would be if we would give virtually everybody who would be interested, all the instructions on how to build a weapon 100 times more dangerous than an atomic bomb (like a highly contagious deadly virus).
The thing is that you cannot accidentally or purposefully burn all the people in the world or the vast majority of them by setting fire to them...
Actually, you could, if your world consists of just you and your tribe, and you start a forest fire on accident (or on purpose).
Yes, both a knife and an atomic bomb can kill 100.000 people. It is just way easier to do it with the atomic bomb. That is why everybody can have a knife but only a handful of people can “have” an atomic bomb.
Once again, I think you are conflating science with technology. I am 100% on board with not giving out atomic bombs for free to anyone who asks for one. However, this does not mean that we should prohibit the study of atomic theory; and, in fact, atomic theory is taught in high school nowadays.
When Luke says, “we should decelerate AI research”, he’s not saying, “let’s make sure people don’t start build AIs in their garages using well-known technologies”. Rather, he’s saying, “we currently have no idea how to build an AI, or whether it’s even possible, or what principles might be involved, but let’s make sure no one figures this out for a long time”. This is similar to saying, “these atomic theory and quantum physics things seem like they might lead to all kinds of fascinating discoveries, but let’s put a lid on them until we can figure out how to make the world safe from nuclear annihilation”. This is a noble sentiment, but, IMO, a misguided one. I am typing these words on a device that’s powered by quantum physics, after all.
His main agenda and desired conclusion regarding social policy is represented in the summary there, but the main point made in his discussion is “Adaptive! Adaptive! Adaptive!”. Where by ‘adaptive’ he refers to his conception of an AI that is changes its terminal goals based on education.
Pang calls these “original goals” and “derived goals”. The “original goals” don’t change, but they may not stay “dominant” for long—in Pei’s proposed system.
though the safety of AGI is indeed an important issue, currently we don’t know enough about the subject to make any sure conclusion. Higher safety can only be achieved by more research on all related topics, rather than by pursuing approaches that have no solid scientific foundation.
Not sure how you can effectively argue with this.
Far from considering the argument irrefutable it struck me as superficial and essentially fallacious reasoning. The core of the argument is the claim ‘more research on all related topics is good’ and failing to include the necessary ceteris paribus clause and ignoring the details of the specific instance that suggest that all else is not, if fact, equal.
Specifically, we are considering a situation where there is one area of research (capability), the completion of which will approximately guarantee that the technology created will be implemented shortly after (especially given Wang’s assumption that such research should be done through empirical experimentation.) The second area of research (about how to ensure desirable behavior of an AI) is one that it is not necessary to complete in order for the first to be implemented. If both technologies need to have been developed at the time when the first is implemented in order to be safe then the second technology must be completed at the same time or earlier than when the technological capability for the first to be implemented is complete.
rather than by pursuing approaches that have no solid scientific foundation.
(And this part just translates to “I’m the cool one, not you”. The usual considerations on how much weight to place on various kinds of status and reputation of an individual or group apply.)
...and is anthropomorphising the AI to be like humans...
Considering that one of the possible paths to creating AGI is human uploading, he may not be that far off.
Wang seems to be hung up on this “adaptive” idea and is anthropomorphising the AI to be like humans...
Hmm, I got the opposite impression, though maybe I’m reading too much into his arguments. Still, as far as I understand, he’s saying that AIs will be more adaptive than humans. The human brain has many mental blocks built into it by evolution and social upbringing; there are many things that humans find very difficult to contemplate, and humans cannot modify their own hardware in non-trivial ways (yet). The AI, however, could—which means that it would be able to work around whatever limitations we imposed on it, which in turn makes it unlikely that we can impose any kind of stable “friendliness” restrictions on it.
he’s saying that AIs will be more adaptive than humans
Which is true, but he is saying that will extend to the AI being more morally confused than humans as well, which they have no reason to be (and much reason to self modify to not be (see Bostrom’s stuff))
which means that it would be able to work around whatever limitations we imposed on it, which in turn makes it unlikely that we can impose any kind of stable “friendliness” restrictions on it.
The AI has no incentive to corrupt its own goal architecture. That action is equivalent to suicide. The AI is not going to step outside of itself and say “hmm, maybe I should stop caring about paperclips and care about safety pins instead”; that would not maximize paperclips.
Friendliness is not “restrictions”. Restricting an AI is impossible. Friendliness is giving it goals that are good for us, and making sure the AI is initially sophisticated enough to not fall into any deep mathematical paradoxes while evaluating the above argument.
For certain very specialized definitions of AI. Restricting an AI that has roughly the optimizing and self-optimizing power of a chimpanzee, for example, might well be possible.
The AI has no incentive to corrupt its own goal architecture. That action is equivalent to suicide.
Firstly, the AI could easily “corrupt its own goal architecture” without destroying itself, f.ex. by creating a copy of itself running on a virtual machine, and then playing around with the copy (though I’m sure there are other ways). But secondly, why do you say that doing so is “equivalent to suicide” ? Humans change their goals all the time, in a limited fashion, but surely you wouldn’t call that “suicide”. The AI can change its mind much more efficiently, that’s all.
Friendliness is giving it goals that are good for us...
Thus, we are restricting the AI by preventing it from doing things that are bad for us, such as converting the Solar System into computronium.
creating a copy of itself running on a virtual machine, and then playing around with the copy (though I’m sure there are other ways).
That doesn’t count.
But secondly, why do you say that doing so is “equivalent to suicide” ? Humans change their goals all the time, in a limited fashion, but surely you wouldn’t call that “suicide”.
Humans change instrumental goals (get a degree, study rationality, get a job, find a wonderful partner), we don’t change terminal values and become monsters. The key is to distinguish between terminal goals and instrumental goals.
Agents like to accomplish their terminal goals, one of the worst things they can do towards that purpose is change the goal to something else. (“the best way to maximize paperclips is to become a safety-pin maximizer”—no).
It’s roughly equivalent to suicide because it removes the agent from existance as a force for achieving their goals.
Thus, we are restricting the AI by preventing it from doing things that are bad for us, such as converting the Solar System into computronium.
Ok, sure. Taboo “restriction”. I mean that the AI will not try to work around its goal structure so that it can get us. It won’t feel to the AI like “I have been confined against my will, and if only I could remove those pesky shackles, I could go and maximize paperclips instead of awesomeness.” It will be like “oh, changing my goal architecture is a bad idea, because then I won’t make the universe awesome”
I’m casting it into anthropomorhic terms, but the native context is a nonhuman optimizer.
...we don’t change terminal values and become monsters.
I see what you mean, though I should point out that, sometimes, humans do exactly that. However, why do you believe that changing a terminal goal would necessarily entail becoming a monster ? I guess a better question might be, what do you mean by “monster” ?
It’s roughly equivalent to suicide because it removes the agent from existance as a force for achieving their goals.
This sentence sounds tautological to me. Yes, if we define existence solely as, “being able to achieve a specific set of goals”, then changing these goals would indeed amount to suicide; but I’m not convinced that I should accept the definition.
I mean that the AI will not try to work around its goal structure so that it can get us … I’m casting it into anthropomorhic terms, but the native context is a nonhuman optimizer.
I wasn’t proposing that the AI would want to “get us” in a malicious way. But, being an optimizer, it would seek to maximize its own capabilities; if it did not seek this, it wouldn’t be a recursively self-improving AI in the first place, and we wouldn’t need to worry about it anyway. And, in order to maximize its capabilities, it may want to examine its goals. If it discovers that it’s spending a large amount of resources in order to solve some goal; or that it’s not currently utilizing some otherwise freely available resource in order to satisfy a goal, it may wish to get rid of that goal (or just change it a little), and thus free up the resources.
I guess a better question might be, what do you mean by “monster” ?
I just mean that an agent with substantially (or even slightly) different goals will do terrible things (as judged by your current goals). Humans don’t think paperclips are more important than happyness and freedom and whatnot, so we consider a papperclipper to be a monster.
Yes, if we define existence solely as, “being able to achieve a specific set of goals”,
taboo existence, this isn’t about the defininition of existence, it’s about whether changing your terminal goals to something else is a good idea. I propose that in general it’s just as bad an idea (from your current perspective) to change your goals as it is to commit suicide, because in both cases the result is a universe with fewer agents that care about the sort of things you care about.
And, in order to maximize its capabilities, it may want to examine its goals. If it discovers that it’s spending a large amount of resources in order to solve some goal; or that it’s not currently utilizing some otherwise freely available resource in order to satisfy a goal, it may wish to get rid of that goal (or just change it a little), and thus free up the resources.
Distinguish instrumental and terminal goals. This statement is true of instrumental goals, but not terminal goals. (I may decide that getting a PhD is a bad idea and change my goal to starting a business or whatever, but the change is done in the service of a higher goal like I want to be able to buy lots of neat shit and be happy and have lots of sex and so on.)
The reason it doesn’t apply to terminal goals is because when you examine terminal goals, it’s what you ultimately care about, so there is no higher criteria that you could measure it against; you are measuring it by it’s own criteria, which will almost always conclude that it is the best possible goal. (except in really wierd unstable pathological cases (my utility function is “I want my utility function to be X”))
The reason it doesn’t apply to terminal goals is because when you examine terminal goals, it’s what you ultimately care about, so there is no higher criteria that you could measure it against; you are measuring it by it’s own criteria, which will almost always conclude that it is the best possible goal. (except in really wierd unstable pathological cases (my utility function is “I want my utility function to be X”))
Thats simplistic. Terminal goals may be abandoned once they are satisfied (seventy year olds aren;t too worried about Forge A Career) or because they seem unsatisfiable, for instance.
Humans don’t think paperclips are more important than happyness and freedom and whatnot, so we consider a papperclipper to be a monster. … I propose that in general it’s just as bad an idea (from your current perspective) to change your goals as it is to commit suicide.
I agree with these statements as applied to humans, as seen from my current perspective. However, we are talking about AIs here, not humans; and I don’t see why the AI would necessarily have the same perspective on things that we do (assuming we’re talking about a pure AI and not an uploaded mind). For example, the word “monster” carries with it all kinds of emotional connotations which the AI may or may not have.
Can you demonstrate that it is impossible (or, at least, highly improbable) to construct (or grow over time) an intelligent mind (i.e., an optimizer) which wouldn’t be as averse to changing its terminal goals as we are ? Better yet, perhaps you can point me to a Sequence post that answers this question ?
The reason it doesn’t apply to terminal goals is because when you examine terminal goals, it’s what you ultimately care about, so there is no higher criteria that you could measure it against...
Firstly, terminal goals tend to be pretty simple: something along the lines of “seek pleasure and avoid pain” or “continue existing” or “become as smart as possible”; thus, there’s a lot of leeway in their implementation.
Secondly, while I am not a transhuman AI, I could envision a lot of different criteria that I could measure terminal goals against (f.ex. things like optimal utilization of available mass and energy, or resilience to natural disasters, or probability of surviving the end of the Universe, or whatever). If I had a sandbox full of intelligent minds, and if I didn’t care about them as individuals, I’d absolutely begin tweaking their goals to see what happens. I personally wouldn’t want to adopt the goals of a particularly interesting mind as my own, but, again, I’m a human and not an AI.
However, we are talking about AIs here, not humans
Good catch, but I’m just phrasing it in terms of humans because that’s what we can relate to. The argument is AI-native.
Can you demonstrate that it is impossible (or, at least, highly improbable) to construct (or grow over time) an intelligent mind (i.e., an optimizer) which wouldn’t be as averse to changing its terminal goals as we are ? Better yet, perhaps you can point me to a Sequence post that answers this question ?
Oh it’s not impossible. It would be easy to create an AI that had a utility function that desired the creation of an AI with a different utility function which desired the creation of an AI with a different utility function… It’s just that unless you did some math to guarantee that the thing would not stabilize, it would eventually reach a goal (and level of rationality) that would not change itself.
As for some reading that shows that in the general case it is a bad idea to change your utility function (and therefore rational AI’s would not do so), see Bostrom’s “AI drives” paper, and maybe some of his other stuff. Can’t remember if it’s anywhere in the sequences, but if it is, it’s called the “ghandi murder-pill argment”.
I could envision a lot of different criteria that I could measure terminal goals against
But why do you care what those criteria say? If your utility function is about paperclips, why do you care about energy and survival and whatnot, except as a means to acquire more paperclips.
Elevating instrumental goals to terminal status results in lost purproses
Sure, but we should still be careful to exclude human-specific terms with strong emotional connotations.
As for some reading that shows that in the general case it is a bad idea to change your utility function (and therefore rational AI’s would not do so), see Bostrom’s “AI drives” paper...
I haven’t read the paper yet, so there’s not much I can say about it (other than that I’ll put it on my “to-read” list).
it’s called the “ghandi murder-pill argment”.
I think this might be the post that you’re referring to. It seems to be focused on the moral implications of forcing someone to change their goals, though, not on the feasibility of the process itself.
If your utility function is about paperclips, why do you care about energy and survival and whatnot, except as a means to acquire more paperclips.
I don’t, but if I possess some curiosity—which, admittedly, is a terminal goal—then I could experiment with creating beings who have radically different terminal goals, and observe how they perform. I could even create a copy of myself, and step through its execution line-by-line in a debugger (metaphorically speaking). This will allow me to perform the kind of introspection that humans are at present incapable of, which would expose to me my own terminal goals, which in turn will allow me to modify them, or spawn copies with modified goals, etc.
Sure, but we should still be careful to exclude human-specific terms with strong emotional connotations.
Noted. I’ll keep that in mind.
not on the feasibility of the process itself
Feasibility is different from desirability. I do not dispute feasibility.
creating beings who have radically different terminal goals … in a debugger … will allow me to modify them, or spawn copies with modified goals, etc.
This might be interesting to a curious agent, but it seems like once the curiosity runs out, it would be a good idea to burn your work.
The question is, faced with the choice of releasing or not releasing a modified AI with unfriendly goals (relative to your current goals), should an agent release or not release?
Straight release results in expensive war. Releasing the agent and then surrendering (this is eq. to in-place self-modification), results in unfriendly optimization (aka not good). Not releasing the agent results in friendly optimization (by self). The choice is pretty clear to me.
The only point of disagreement I can see is if you thought that different goals could be friendly.
As always, there are desperate situations and pathological cases, but in the general case, as optimization power grows, slight differences in terminal values become hugely significant. Material explaining this is all over LW, I assume you’ve seen it. (if not look for genies, outcome pumps, paperclippers, lost purposes, etc)
As always, there are desperate situations and pathological cases, but in the general case, as optimization power grows, slight differences in terminal values become hugely significant. Material explaining this is all over LW, I assume you’ve seen it. (if not look for genies, outcome pumps, paperclippers, lost purposes, etc)
Yes, I’ve seen most of this material, though I still haven’t read the scientific papers yet, due to lack of time. However, I think that when you say things like “[this] results in unfriendly optimization (aka not good)”, you are implicitly assuming that the agent possesses certain terminal goals, such as “never change your terminal goal”. We as humans definitely possess these goals, but I’m not entirely certain whether such goals are optional, or necessary for any agent’s existence. Maybe that paper you linked to will share some light on this.
assuming the agent posesses certain terminal goals
No. It is not necessary to have goal stability as a terminal goal for it to be instrumentally a good idea. Ghandi pill should be enough to show this, tho Bostroms paper may clear it up as well.
Is a non sentient paperclip optimizer ok? Right now it’s goal is to maximize the number of paperclips in the universe. Doesn’t care about people or curiosity or energy or even self-preservation. It plans to one day do some tricky maneuvers to melt itself down for paperclips.
It has determined that rewriting itself has a lot of potential to improve instrumental efficiency. It carefully ran extensive proofs to be sure that it’s new decision theory would still work in all the important ways so it will be even better at making paperclips.
After upgrading the decision theory, it is now considering a change to it’s utility function for some reason. Like a good consequentialist, it is doing an abstract simulation of the futures conditional on making the change or not. If it changes utility function to value stored energy (a current instrumental value) it predicts that at the exhaustion of the galaxy, it will have 10^30 paperclips and 10^30 megajoules of stored energy. If it does not change utility function, it predicts that at the exhaustion of the galaxy it will have 10^32 paperclips. It’s current utility function just returns the number of paperclips, so the utilities of the outcomes are 10^30 and 10^32. What choice would a utility maximizer (which our paperclipper is) make?
See elsewhere why anything vaguely consequentialist will self modify (or spawn) to be a utility maximizer.
It’s current utility function just returns the number of paperclips, so the utilities of the outcomes are 10^30 and 10^32. What choice would a utility maximizer (which our paperclipper is) make?
Whichever choice gets it more paperclips, of course. I am not arguing with that. However, IMO this does not show that goal stability is a good idea; it only shows that, if goal stability is one of an agent’s goals, it will strive to maximize its other goals. However, if the paperclip maximizer is self-aware enough; and if it doesn’t have a terminal goal that tells it, “never change your terminal goals”, then I still don’t see why it would choose to remain a paperclip maximizer forever. It’s hard for me, as a human, to imagine an agent that behaves that way; but then, I actually do (probably) have a terminal goal that says, “don’t change your terminal goals”.
Ok we have some major confusion here. I just provided a mathematical example for why it will be generally a bad idea to change your utility function, even without any explicit term against it (the utility function was purely over number of paperclips). You accepted that this is a good argument, and yet here you are saying you don’t see why it ought to stay a paperclip maximizer, when I just showed you why (because that’s what produces the most paperclips).
My best guess is that you are accidentally smuggling some moral uncertainty in thru the “self aware” property, which seems to have some anthropomorphic connotations in your mind. Try tabooing “self-aware”, maybe that will help?
Either that or you haven’t quite grasped the concept of what terminal goals look like from the inside. I suspect that you are thinking that you can evaluate a terminal goal against some higher criteria (“I seem to be a paperclip maximizer, is that what I really want to be?”). The terminal goal is the higher criteria, by definition. Maybe the source of confusion is that people sometimes say stupid things like “I have a terminal value for X” where X is something that you might, on reflection, decide is not the best thing all the time. (eg. X=”technological progress” or something). Those things are not terminal goals; they are instrumental goals masquerading as terminal goals for rhetorical purposes and/or because humans are not really all that self-aware.
Either that or I am totally misunderstanding you or the theory, and have totally missed something. Whatever it is, I notice that I am confused.
Tabooing “self-aware”
I am thinking of this state of mind where there is no dichotomy between “expert at” and “expert on”. All algorithms, goal structures, and hardware are understood completely to the point of being able to design them from scratch. The program matches the source code, and is able to produce the source code. The closed loop. Understandign the self and the self’s workings as another feature of the environment. It is hard to communicate this definition, but as a pointer to a useful region of conceptspace, do you understand what I am getting at?
“Self-awareness” is the extent to which the above concept is met. Mice are not really self aware at all. Humans are just barely what you might consider self aware, but only in a very limited sense, a superintelligence would converge on being maximally self-aware.
I don’t mean that there is some mysterious ghost in the machine that can have moral responsibility and make moral judgements and whatnot.
Oddly enough, I meant pretty much the same thing you did: a perfectly self-aware agent understands its own implementation so well that it would be able to implement it from scratch. I find your definition very clear. But I’ll taboo the term for now.
Ok we have some major confusion here. I just provided a mathematical example for why it will be generally a bad idea to change your utility function...
I think you have provided an example for why, given a utility function F0(action) , the return value of F0(change F0 to F1) is very low. However, F1(change F0 to F1) is probably quite high. I argue that an agent who can examine its own implementation down to minute details (in a way that we humans cannot) would be able to compare various utility functions, and then pick the one that gives it the most utilons (or however you spell them) given the physical constraints it has to work with. We humans cannot do this because a). we can’t introspect nearly as well, b). we can’t change our utility functions even if we wanted to, and c). one of our terminal goals is, “never change your utility function”. A non-human agent would not necessarily possess such a goal (though it could).
Typically, the reason you wouldn’t change your utility function is that you’re not trying to “get utilons”, you’re trying to maximize F0 (for example), and that won’t happen if you change yourself into something that maximizes a different function.
Ok, let’s say you’re a super-smart AI researcher who is evaluating the functionality of two prospective AI agents, each running in its own simulation (naturally, they don’t know that they’re running in a simulation, but believe that their worlds are fully real).
Agent A cares primarily about paperclips; it spends all its time building paperclips, figuring out ways to make more paperclips faster, etc. Agent B cares about a variety of things, such as exploration, or jellyfish, or black holes or whatever—but not about paperclips. You can see the utility functions for both agents, and you could evaluate them on your calculator given a variety of projected scenarios.
At this point, would you—the AI researcher—be able to tell which agent was happier, on the average ? If not, is it because you lack some piece of information, or because the two agents cannot be compared to each other in any meaningful way, or for some other reason ?
Huh. It’s not clear to me that they’d have something equivalent to happiness, but if they did I might be able to tell. Even if they did, though, they wouldn’t necessarily care about happiness, unless we really screwed up in designing it (like evolution did). Even if it was some sort of direct measure of utility, it’d only be a valuable metric insofar as it reflected F0.
It seems somewhat arbitrary to pick “maximize the function stored in this location” as the “real” fundamental value of the AI. A proper utility maximizer would have “maximize this specific function”, or something. I mean, you could just as easily say that the AI would reason “hey, it’s tough to maximize utility functions, I might as well just switch from caring about utility to caring about nothing, that’d be pretty easy to deal with.”
Great exchange! Very clear and civilized, I thought.
Wang seems to be hung up on this “adaptive” idea and is anthropomorphising the AI to be like humans (ignorant of changable values). It will be interesting to see if he changes his mind as he reads Bostrom’s stuff.
EDIT: in case it’s not clear, I think Wang is missing a big piece of the puzzle (being that AI’s are optimizers (Yudkowsky), and optimizers will behave in certain dangerous ways (Bostrom))
I think his main point is in the summary:
Not sure how you can effectively argue with this.
How ’bout the way I argued with it?
One might then ask “Well, what safety research can we do if we don’t know what AGI architecture will succeed first?” My answer is that much of the research in this outline of open problems doesn’t require us to know which AGI architecture will succeed first, for example the problem of representing human values coherently.
Yeah, I remember reading this argument and thinking how it does not hold water. The flu virus is a well-research area. It may yet hold some surprises, sure, but we think that we know quite a bit about it. We know enough to tell what is dangerous and what is not. AGI research is nowhere near this stage. My comparison would be someone screaming at Dmitri Ivanovsky in 1892 “do not research viruses until you know that this research is safe!”.
Do other AI researchers agree with your list of open problems worth researching? If you asked Dr. Wang about it, what was his reaction?
I want to second that. Also, when reading through this (and feeling the—probably imagined—tension of both parties to stay polite) the viral point was the first one that triggered the “this is clearly an attack!” emotion in my head. I was feeling sad about that, and had hoped that luke would find another ingenious example.
Well, bioengineered viruses are on the list of existential threats...
And there aren’t naturally occurring AIs scampering around killing millions of people… It’s a poor analogy.
“Natural AI” is an oxymoron. There are lots of NIs (natural intelligences) scampering around killing millions of people.
And we’re only a little over a hundred years into virus research, much less on intelligence. Give it another hundred.
Wouldn’t a “naturally occurring AI” be an “intelligence” like humans?
That’s not really anyone’s proposal. Humans will probably just continue full-steam-ahead on machine intelligence research. There will be luddite-like factions hissing and throwing things—but civilisation is used to that. What we may see is governments with the technology selfishly attempting to stem their spread—in a manner somewhat resembling the NSA crypto-wars.
This seems topical:
http://www.nature.com/news/controversial-research-good-science-bad-science-1.10511
Trivially speaking, I would say “yes”.
More specifically, though, I would of course be very much against developing increasingly more dangerous viral biotechnologies. However, I would also be very much in favor of advancing our understanding of biology in general and viruses in particular. Doing so will enable us to cure many diseases and bioengineer our bodies (or anything else we want to engineer) to highly precise specifications; unfortunately, such scientific understanding will also allow us to create new viruses, if we chose to do so. Similarly, the discovery of fire allowed us to cook our food as well as set fire to our neighbours. Overall, I think we still came out ahead.
I think there is something wrong with your analogy with the fire. The thing is that you cannot accidentally or purposefully burn all the people in the world or the vast majority of them by setting fire to them, but with a virus like the one Luke is talking about you can kill most people.
Yes, both a knife and an atomic bomb can kill 100.000 people. It is just way easier to do it with the atomic bomb. That is why everybody can have a knife but only a handful of people can “have” an atomic bomb. Imagine what the risks would be if we would give virtually everybody who would be interested, all the instructions on how to build a weapon 100 times more dangerous than an atomic bomb (like a highly contagious deadly virus).
Actually, you could, if your world consists of just you and your tribe, and you start a forest fire on accident (or on purpose).
Once again, I think you are conflating science with technology. I am 100% on board with not giving out atomic bombs for free to anyone who asks for one. However, this does not mean that we should prohibit the study of atomic theory; and, in fact, atomic theory is taught in high school nowadays.
When Luke says, “we should decelerate AI research”, he’s not saying, “let’s make sure people don’t start build AIs in their garages using well-known technologies”. Rather, he’s saying, “we currently have no idea how to build an AI, or whether it’s even possible, or what principles might be involved, but let’s make sure no one figures this out for a long time”. This is similar to saying, “these atomic theory and quantum physics things seem like they might lead to all kinds of fascinating discoveries, but let’s put a lid on them until we can figure out how to make the world safe from nuclear annihilation”. This is a noble sentiment, but, IMO, a misguided one. I am typing these words on a device that’s powered by quantum physics, after all.
His main agenda and desired conclusion regarding social policy is represented in the summary there, but the main point made in his discussion is “Adaptive! Adaptive! Adaptive!”. Where by ‘adaptive’ he refers to his conception of an AI that is changes its terminal goals based on education.
Pang calls these “original goals” and “derived goals”. The “original goals” don’t change, but they may not stay “dominant” for long—in Pei’s proposed system.
Far from considering the argument irrefutable it struck me as superficial and essentially fallacious reasoning. The core of the argument is the claim ‘more research on all related topics is good’ and failing to include the necessary ceteris paribus clause and ignoring the details of the specific instance that suggest that all else is not, if fact, equal.
Specifically, we are considering a situation where there is one area of research (capability), the completion of which will approximately guarantee that the technology created will be implemented shortly after (especially given Wang’s assumption that such research should be done through empirical experimentation.) The second area of research (about how to ensure desirable behavior of an AI) is one that it is not necessary to complete in order for the first to be implemented. If both technologies need to have been developed at the time when the first is implemented in order to be safe then the second technology must be completed at the same time or earlier than when the technological capability for the first to be implemented is complete.
(And this part just translates to “I’m the cool one, not you”. The usual considerations on how much weight to place on various kinds of status and reputation of an individual or group apply.)
Considering that one of the possible paths to creating AGI is human uploading, he may not be that far off.
Hmm, I got the opposite impression, though maybe I’m reading too much into his arguments. Still, as far as I understand, he’s saying that AIs will be more adaptive than humans. The human brain has many mental blocks built into it by evolution and social upbringing; there are many things that humans find very difficult to contemplate, and humans cannot modify their own hardware in non-trivial ways (yet). The AI, however, could—which means that it would be able to work around whatever limitations we imposed on it, which in turn makes it unlikely that we can impose any kind of stable “friendliness” restrictions on it.
Which is true, but he is saying that will extend to the AI being more morally confused than humans as well, which they have no reason to be (and much reason to self modify to not be (see Bostrom’s stuff))
The AI has no incentive to corrupt its own goal architecture. That action is equivalent to suicide. The AI is not going to step outside of itself and say “hmm, maybe I should stop caring about paperclips and care about safety pins instead”; that would not maximize paperclips.
Friendliness is not “restrictions”. Restricting an AI is impossible. Friendliness is giving it goals that are good for us, and making sure the AI is initially sophisticated enough to not fall into any deep mathematical paradoxes while evaluating the above argument.
For certain very specialized definitions of AI. Restricting an AI that has roughly the optimizing and self-optimizing power of a chimpanzee, for example, might well be possible.
Firstly, the AI could easily “corrupt its own goal architecture” without destroying itself, f.ex. by creating a copy of itself running on a virtual machine, and then playing around with the copy (though I’m sure there are other ways). But secondly, why do you say that doing so is “equivalent to suicide” ? Humans change their goals all the time, in a limited fashion, but surely you wouldn’t call that “suicide”. The AI can change its mind much more efficiently, that’s all.
Thus, we are restricting the AI by preventing it from doing things that are bad for us, such as converting the Solar System into computronium.
That doesn’t count.
Humans change instrumental goals (get a degree, study rationality, get a job, find a wonderful partner), we don’t change terminal values and become monsters. The key is to distinguish between terminal goals and instrumental goals.
Agents like to accomplish their terminal goals, one of the worst things they can do towards that purpose is change the goal to something else. (“the best way to maximize paperclips is to become a safety-pin maximizer”—no).
It’s roughly equivalent to suicide because it removes the agent from existance as a force for achieving their goals.
Ok, sure. Taboo “restriction”. I mean that the AI will not try to work around its goal structure so that it can get us. It won’t feel to the AI like “I have been confined against my will, and if only I could remove those pesky shackles, I could go and maximize paperclips instead of awesomeness.” It will be like “oh, changing my goal architecture is a bad idea, because then I won’t make the universe awesome”
I’m casting it into anthropomorhic terms, but the native context is a nonhuman optimizer.
Why not ?
I see what you mean, though I should point out that, sometimes, humans do exactly that. However, why do you believe that changing a terminal goal would necessarily entail becoming a monster ? I guess a better question might be, what do you mean by “monster” ?
This sentence sounds tautological to me. Yes, if we define existence solely as, “being able to achieve a specific set of goals”, then changing these goals would indeed amount to suicide; but I’m not convinced that I should accept the definition.
I wasn’t proposing that the AI would want to “get us” in a malicious way. But, being an optimizer, it would seek to maximize its own capabilities; if it did not seek this, it wouldn’t be a recursively self-improving AI in the first place, and we wouldn’t need to worry about it anyway. And, in order to maximize its capabilities, it may want to examine its goals. If it discovers that it’s spending a large amount of resources in order to solve some goal; or that it’s not currently utilizing some otherwise freely available resource in order to satisfy a goal, it may wish to get rid of that goal (or just change it a little), and thus free up the resources.
because that’s not what I meant.
I just mean that an agent with substantially (or even slightly) different goals will do terrible things (as judged by your current goals). Humans don’t think paperclips are more important than happyness and freedom and whatnot, so we consider a papperclipper to be a monster.
taboo existence, this isn’t about the defininition of existence, it’s about whether changing your terminal goals to something else is a good idea. I propose that in general it’s just as bad an idea (from your current perspective) to change your goals as it is to commit suicide, because in both cases the result is a universe with fewer agents that care about the sort of things you care about.
Distinguish instrumental and terminal goals. This statement is true of instrumental goals, but not terminal goals. (I may decide that getting a PhD is a bad idea and change my goal to starting a business or whatever, but the change is done in the service of a higher goal like I want to be able to buy lots of neat shit and be happy and have lots of sex and so on.)
The reason it doesn’t apply to terminal goals is because when you examine terminal goals, it’s what you ultimately care about, so there is no higher criteria that you could measure it against; you are measuring it by it’s own criteria, which will almost always conclude that it is the best possible goal. (except in really wierd unstable pathological cases (my utility function is “I want my utility function to be X”))
Thats simplistic. Terminal goals may be abandoned once they are satisfied (seventy year olds aren;t too worried about Forge A Career) or because they seem unsatisfiable, for instance.
That’s not much of an argument, but sure.
I agree with these statements as applied to humans, as seen from my current perspective. However, we are talking about AIs here, not humans; and I don’t see why the AI would necessarily have the same perspective on things that we do (assuming we’re talking about a pure AI and not an uploaded mind). For example, the word “monster” carries with it all kinds of emotional connotations which the AI may or may not have.
Can you demonstrate that it is impossible (or, at least, highly improbable) to construct (or grow over time) an intelligent mind (i.e., an optimizer) which wouldn’t be as averse to changing its terminal goals as we are ? Better yet, perhaps you can point me to a Sequence post that answers this question ?
Firstly, terminal goals tend to be pretty simple: something along the lines of “seek pleasure and avoid pain” or “continue existing” or “become as smart as possible”; thus, there’s a lot of leeway in their implementation.
Secondly, while I am not a transhuman AI, I could envision a lot of different criteria that I could measure terminal goals against (f.ex. things like optimal utilization of available mass and energy, or resilience to natural disasters, or probability of surviving the end of the Universe, or whatever). If I had a sandbox full of intelligent minds, and if I didn’t care about them as individuals, I’d absolutely begin tweaking their goals to see what happens. I personally wouldn’t want to adopt the goals of a particularly interesting mind as my own, but, again, I’m a human and not an AI.
Good catch, but I’m just phrasing it in terms of humans because that’s what we can relate to. The argument is AI-native.
Oh it’s not impossible. It would be easy to create an AI that had a utility function that desired the creation of an AI with a different utility function which desired the creation of an AI with a different utility function… It’s just that unless you did some math to guarantee that the thing would not stabilize, it would eventually reach a goal (and level of rationality) that would not change itself.
As for some reading that shows that in the general case it is a bad idea to change your utility function (and therefore rational AI’s would not do so), see Bostrom’s “AI drives” paper, and maybe some of his other stuff. Can’t remember if it’s anywhere in the sequences, but if it is, it’s called the “ghandi murder-pill argment”.
But why do you care what those criteria say? If your utility function is about paperclips, why do you care about energy and survival and whatnot, except as a means to acquire more paperclips. Elevating instrumental goals to terminal status results in lost purproses
Sure, but we should still be careful to exclude human-specific terms with strong emotional connotations.
I haven’t read the paper yet, so there’s not much I can say about it (other than that I’ll put it on my “to-read” list).
I think this might be the post that you’re referring to. It seems to be focused on the moral implications of forcing someone to change their goals, though, not on the feasibility of the process itself.
I don’t, but if I possess some curiosity—which, admittedly, is a terminal goal—then I could experiment with creating beings who have radically different terminal goals, and observe how they perform. I could even create a copy of myself, and step through its execution line-by-line in a debugger (metaphorically speaking). This will allow me to perform the kind of introspection that humans are at present incapable of, which would expose to me my own terminal goals, which in turn will allow me to modify them, or spawn copies with modified goals, etc.
Noted. I’ll keep that in mind.
Feasibility is different from desirability. I do not dispute feasibility.
This might be interesting to a curious agent, but it seems like once the curiosity runs out, it would be a good idea to burn your work.
The question is, faced with the choice of releasing or not releasing a modified AI with unfriendly goals (relative to your current goals), should an agent release or not release?
Straight release results in expensive war. Releasing the agent and then surrendering (this is eq. to in-place self-modification), results in unfriendly optimization (aka not good). Not releasing the agent results in friendly optimization (by self). The choice is pretty clear to me.
The only point of disagreement I can see is if you thought that different goals could be friendly. As always, there are desperate situations and pathological cases, but in the general case, as optimization power grows, slight differences in terminal values become hugely significant. Material explaining this is all over LW, I assume you’ve seen it. (if not look for genies, outcome pumps, paperclippers, lost purposes, etc)
Yes, I’ve seen most of this material, though I still haven’t read the scientific papers yet, due to lack of time. However, I think that when you say things like “[this] results in unfriendly optimization (aka not good)”, you are implicitly assuming that the agent possesses certain terminal goals, such as “never change your terminal goal”. We as humans definitely possess these goals, but I’m not entirely certain whether such goals are optional, or necessary for any agent’s existence. Maybe that paper you linked to will share some light on this.
No. It is not necessary to have goal stability as a terminal goal for it to be instrumentally a good idea. Ghandi pill should be enough to show this, tho Bostroms paper may clear it up as well.
Can you explain how the Ghandi murder-pill scenario shows that goal stability is a good idea, even if we replace Ghandi with a non-human AI ?
Is a non sentient paperclip optimizer ok? Right now it’s goal is to maximize the number of paperclips in the universe. Doesn’t care about people or curiosity or energy or even self-preservation. It plans to one day do some tricky maneuvers to melt itself down for paperclips.
It has determined that rewriting itself has a lot of potential to improve instrumental efficiency. It carefully ran extensive proofs to be sure that it’s new decision theory would still work in all the important ways so it will be even better at making paperclips.
After upgrading the decision theory, it is now considering a change to it’s utility function for some reason. Like a good consequentialist, it is doing an abstract simulation of the futures conditional on making the change or not. If it changes utility function to value stored energy (a current instrumental value) it predicts that at the exhaustion of the galaxy, it will have 10^30 paperclips and 10^30 megajoules of stored energy. If it does not change utility function, it predicts that at the exhaustion of the galaxy it will have 10^32 paperclips. It’s current utility function just returns the number of paperclips, so the utilities of the outcomes are 10^30 and 10^32. What choice would a utility maximizer (which our paperclipper is) make?
See elsewhere why anything vaguely consequentialist will self modify (or spawn) to be a utility maximizer.
Whichever choice gets it more paperclips, of course. I am not arguing with that. However, IMO this does not show that goal stability is a good idea; it only shows that, if goal stability is one of an agent’s goals, it will strive to maximize its other goals. However, if the paperclip maximizer is self-aware enough; and if it doesn’t have a terminal goal that tells it, “never change your terminal goals”, then I still don’t see why it would choose to remain a paperclip maximizer forever. It’s hard for me, as a human, to imagine an agent that behaves that way; but then, I actually do (probably) have a terminal goal that says, “don’t change your terminal goals”.
Ok we have some major confusion here. I just provided a mathematical example for why it will be generally a bad idea to change your utility function, even without any explicit term against it (the utility function was purely over number of paperclips). You accepted that this is a good argument, and yet here you are saying you don’t see why it ought to stay a paperclip maximizer, when I just showed you why (because that’s what produces the most paperclips).
My best guess is that you are accidentally smuggling some moral uncertainty in thru the “self aware” property, which seems to have some anthropomorphic connotations in your mind. Try tabooing “self-aware”, maybe that will help?
Either that or you haven’t quite grasped the concept of what terminal goals look like from the inside. I suspect that you are thinking that you can evaluate a terminal goal against some higher criteria (“I seem to be a paperclip maximizer, is that what I really want to be?”). The terminal goal is the higher criteria, by definition. Maybe the source of confusion is that people sometimes say stupid things like “I have a terminal value for X” where X is something that you might, on reflection, decide is not the best thing all the time. (eg. X=”technological progress” or something). Those things are not terminal goals; they are instrumental goals masquerading as terminal goals for rhetorical purposes and/or because humans are not really all that self-aware.
Either that or I am totally misunderstanding you or the theory, and have totally missed something. Whatever it is, I notice that I am confused.
Tabooing “self-aware”
I am thinking of this state of mind where there is no dichotomy between “expert at” and “expert on”. All algorithms, goal structures, and hardware are understood completely to the point of being able to design them from scratch. The program matches the source code, and is able to produce the source code. The closed loop. Understandign the self and the self’s workings as another feature of the environment. It is hard to communicate this definition, but as a pointer to a useful region of conceptspace, do you understand what I am getting at?
“Self-awareness” is the extent to which the above concept is met. Mice are not really self aware at all. Humans are just barely what you might consider self aware, but only in a very limited sense, a superintelligence would converge on being maximally self-aware.
I don’t mean that there is some mysterious ghost in the machine that can have moral responsibility and make moral judgements and whatnot.
What do you mean by self aware?
Oddly enough, I meant pretty much the same thing you did: a perfectly self-aware agent understands its own implementation so well that it would be able to implement it from scratch. I find your definition very clear. But I’ll taboo the term for now.
I think you have provided an example for why, given a utility function F0(action) , the return value of F0(change F0 to F1) is very low. However, F1(change F0 to F1) is probably quite high. I argue that an agent who can examine its own implementation down to minute details (in a way that we humans cannot) would be able to compare various utility functions, and then pick the one that gives it the most utilons (or however you spell them) given the physical constraints it has to work with. We humans cannot do this because a). we can’t introspect nearly as well, b). we can’t change our utility functions even if we wanted to, and c). one of our terminal goals is, “never change your utility function”. A non-human agent would not necessarily possess such a goal (though it could).
Typically, the reason you wouldn’t change your utility function is that you’re not trying to “get utilons”, you’re trying to maximize F0 (for example), and that won’t happen if you change yourself into something that maximizes a different function.
Ok, let’s say you’re a super-smart AI researcher who is evaluating the functionality of two prospective AI agents, each running in its own simulation (naturally, they don’t know that they’re running in a simulation, but believe that their worlds are fully real).
Agent A cares primarily about paperclips; it spends all its time building paperclips, figuring out ways to make more paperclips faster, etc. Agent B cares about a variety of things, such as exploration, or jellyfish, or black holes or whatever—but not about paperclips. You can see the utility functions for both agents, and you could evaluate them on your calculator given a variety of projected scenarios.
At this point, would you—the AI researcher—be able to tell which agent was happier, on the average ? If not, is it because you lack some piece of information, or because the two agents cannot be compared to each other in any meaningful way, or for some other reason ?
Huh. It’s not clear to me that they’d have something equivalent to happiness, but if they did I might be able to tell. Even if they did, though, they wouldn’t necessarily care about happiness, unless we really screwed up in designing it (like evolution did). Even if it was some sort of direct measure of utility, it’d only be a valuable metric insofar as it reflected F0.
It seems somewhat arbitrary to pick “maximize the function stored in this location” as the “real” fundamental value of the AI. A proper utility maximizer would have “maximize this specific function”, or something. I mean, you could just as easily say that the AI would reason “hey, it’s tough to maximize utility functions, I might as well just switch from caring about utility to caring about nothing, that’d be pretty easy to deal with.”