A lot of the approaches to the “China alignment problem” rely on modifying the game theoretic position, given a fixed utility function. Ie having weapons and threatening to use them. This only works against an opponent to which your weapons pose a real threat. If, 20 years after the start of Moof, the AI’s can defend against all human weapons with ease, and can make any material goods using less raw materials and energy than the humans use, then the AI’s lack a strong reason to keep us around.
If the AIs are a monolithic entity whose values are universally opposed to those of humans then, yes, we are doomed. But I don’t think this has to be the case. If the post-singularity world consists of an ecosystem of AIs whose mutually competing interests causes them to balance one-another and engage in positive sum games then humanity is preserved not because the AI fears us, but because that is the “norm of behavior” for agents in their society.
Yes, it is scary to imagine a future where humans are no longer at the helm, but I think it is possible to build a future where our values are tolerated and allowed to continue to exist.
By contrast, I am not optimistic about attempts to “extrapolate” human values to an AI capable of acts like turning the entire world into paperclips. Humans are greedy, superstitious and naive. Hopefully our AI descendants will be our better angels and build a world better than any that we can imagine.
If the post-singularity world consists of an ecosystem of AIs whose mutually competing interests causes them to balance one-another and engage in positive sum games then humanity is preserved not because the AI fears us, but because that is the “norm of behavior” for agents in their society.
So many different AI’s with many different goals, all easily capable of destroying humanity, none that intrinsicly wants to protect humanity.Yet none decides that destroying humanity is a good idea.
Human values are large and arbitrary. The only agent optimising them is humans, and
By contrast, I am not optimistic about attempts to “extrapolate” human values to an AI capable of acts like turning the entire world into paperclips. Humans are greedy, superstitious and naive. Hopefully our AI descendants will be our better angels and build a world better than any that we can imagine.
Suppose you want to make a mechanical clock. You have tried to build one in a metalwork workshop and not got anything to work yet. So you decide to go to the scrap pile and start throwing rocks at it, in the hope that you can make a clock that way. Now maybe it is possible to make a crude clock, at least nudge a beam into a position where it can swing back and forth, by throwing a lot of rocks at just the right angles. You are still being stupid, because you are ignoring effective tools and making the problem needlessly harder for yourself.
I feel that you are doing the same in AI design. Free reign over the space of utility functions, any piece of computer code you care to write is a powerful and general capability. Trying to find Nash equilibria is throwing rocks at a junkyard. Trying to find Nash equilibria without knowing how many AI’s there are or how those AI’s are designed is thowing rocks in the junkyard while blindfolded.
Suppose the AI has developed the tech to upload a human mind into a virtual paradise, and is deciding whether to do it or not. In an aligned AI, you get to write arbitrary code to describe the procedure to a human, and interpret the humans answer. Maybe the human doesn’t have a concept of mind uploading, and the AI is deciding whether to go for “mechanical dream realm” or “artificial heaven” or “like replacing a leg with a wooden one, except the wooden leg is better than your old one, and for all of you not just a leg”. Of course, the raw data of its abstract reasoning files is Gb of gibberish, and making it output anything more usable is non trivial. Maybe the human’s answer depends on how you ask the question. Maybe the human answers “Um maybe, I don’t know”. Maybe the AI spots a flaw in the humans reasoning, does it point it out? The problem of asking a human a question is highly non trivial.
In the general aligned AI paradigm, if you have a formal answer to this problem, you can just type it up and that’s your code.
In your Nash equilibria, once you have a formal answer, you still have to design a nash equilibria that makes AI’s care about that formal answer, and then ensure that real world AI’s fall into that Nash equilibria.
If you hope to get a Nash equilibria that asks humans questions and listens to the answers without a formal description of exactly what you mean by “asks humans questions and listens to the answers”, then could you explain what property singles this behaviour out as a Nash equilibria. From the point of view of abstract maths, there is no obvious way to distinguish a function that converts the AI’s abstract world models into english, from one that translates it into japanese, klingon, or any of trillions of varieties of gibberish. And no the AI doesn’t just “Know english”.
Suppose you start listening to chinese radio. After a while you notice patterns, you get quite good at predicting which meaningless sounds follow which other meaningless sounds. You then go to china. You start repeating strings of meaningless sounds at Chinese people. They respond back with equally meaningless strings of sounds. Over time you get quite good at predicting what the response will be. If you say “Ho yaa” they will usually respond “du sin”, but the old men sometimes respond “du son”. Sometimes the chinese people start jumping up and down or pointing to you. You know a pattern of sounds that will usually cause chinese people to jump up and down, but you have no idea why. Are you giving them good news and their jumping for joy? Are you insulting them and they are hopping mad? Is it some strange chinese custom to jump when they hear a particular word? Are you asking them to jump? ordering them to jump? Telling them that jumping is an exceptionally healthy exercise? Initiating a jumping contest? You have no idea. Maybe you find a string of sounds that makes chinese people give you food, but have no idea if you are telling a sob story, making threats, or offering to pay and then running off.
Now replace the chinese people with space aliens. You don’t even know if they have an emotion of angry. You don’t know if they have emotions at all. You are still quite good at predicting how they will behave. This is the position that an AI is in.
You are still being stupid, because you are ignoring effective tools and making the problem needlessly harder for yourself.
I think this is precisely where we disagree. I believe that we do not have effective tools for writing utility functions and we do have effective tools for designing at least one Nash Equilibrium that preserves human value, namely:
1) All entities have the right to hold and express their own values freely
2) All entities have the right to engage in positive-sum trades with other entities
3) Violence is anathema.
Some more about why I think humans are bad at writing utility functions:
I am the extremely skeptical about anything of the form: We will define a utility function that encodes human values. Machine learning is really good at misinterpreting utility functions written by humans. I think this problem will only get worse with a super-intelligence AI.
I am more optimistic about goals of the form “Learn to ask what humans want”. But I still think these will fail eventually. There are lots of questions even ardent utilitarians would have difficulty answering. For example, “Torture 1 person or give 3^^^3 people a slight headache?”.
I’m not saying all efforts to design friendly AIs are pointless, or that we should willingly release paperclip maximizes on the world. Rather, I believe we boost our chances of preserving human existence and values by encouraging a multi-polar world with lots of competing (but non-violent) AIs. The competing plan of “don’t create AI until we have designed the perfect utility function and hope that our AI is the dominant one” seems like it has a much higher risk of failure, especially in a world where other people will also be developing AI.
Importantly, we have the technology to deploy “build a world where people are mostly free and non-violent” today, and I don’t think we have the technology to “design a utility function that is robust against misinterpretation by a recursively improving AI”.
One additional aside
Suppose the AI has developed the tech to upload a human mind into a virtual paradise, and is deciding whether to do it or not.
I must confess the goals of this post are more modest than this. The Nash equilibrium I described is one that preserves human existence and values as they are it does nothing in the domain of creating a virtual paradise where humans will enjoy infinite pleasure (and in fact actively avoids forcing this on people).
I suspect some people will try to build AIs that grant them infinite pleasure, and I do not grudge them this (so long as they do so in a way that respects the rights of others to choose freely). Humans will fall into many camps. Those who just want to be left alone, those who wish to pursue knowledge, those who wish to enjoy paradise. I want to build a world where all of those groups can co-exist without wiping out one-another or being wiped out by a malevolent AI.
1) All entities have the right to hold and express their own values freely
2) All entities have the right to engage in positive-sum trades with other entities
3) Violence is anathema.
The problem is that these sound simple, they are easily expressed in english, but they are pointers to your moral decisions. For example, which lifeforms count as “entities”? If the AI’s decide that every bacteria is an entity that can hold and express its values freely then the result will probably look very weird, and might involve humans being ripped apart to rescue the bacteria inside them. Unborn babies? Brain damaged people? The word entities is a reference to your own concept of a morally valuable being. You have within your own head, a magic black box that can take in descriptions of various things, and decide whether or not they are “entities with the right to hold and express values freely”.
You have a lot of information within your own head about what counts as an entity, what counts as violence ect, that you want to transfer to the AI.
All entities have the right to engage in positive-sum trades with other entities
This is especially problematic. The whole reason that any of this is difficult is because humans are not perfect game theoretic agents. Game theoretic agents have a fully specified utility function, and maximise it perfectly. There is no clear line between offering a human something they want, and persuading a human to want something with manipulative marketing. In some limited situations, humans can kind of be approximated as game theoretic agents. However, this approximation breaks down in a lot of circumstances.
I think that there might be a lot of possible Nash equilibria. Any set rules that say to enforce all the rules including this one could be a Nash equilibria. I see a vast space of ways to treat humans. Most of that space contains ways humans wouldn’t like. There could be just one Nash equilibria, or the whole space could be full of Nash equilibria. So either their isn’t a nice Nash equilibria, or we have to pick the nice equilibria from amongst gazillions of nasty ones. In much the same way, if you start picking random letters, either you won’t get a sentence, or if you pick enough you will get a sentence buried in piles of gibberish.
Importantly, we have the technology to deploy “build a world where people are mostly free and non-violent” today, and I don’t think we have the technology to “design a utility function that is robust against misinterpretation by a recursively improving AI”.
The mostly free and nonvionlent kindof state of affairs is a Nash equilibria in the current world. It is only a Nash equilibria based on a lot of contingent facts about human psycology, culture and socioeconomic situation. Many other human cultures, most historical, have embraced slavery, pillaging and all sorts of other stuff. Humans have a sense of empathy, and all else being equal, would prefer to be nice to other humans. Humans have an inbuilt anger mechanism that automatically retaliates against others, whether or not it benefits themselves. Humans have strongly bounded personal utillities. The current economic situation makes the gains from cooperating relatively large.
So in short, Nash equilibria amongst super-intelligences are very different from Nash equilibria amongst humans. Picking which equilibria a bunch of superintelligences end up in is hard. Humans being nice around the developing AI will not cause the AI’s to magically fall into a nice equilibria, any more than humans being full of blood around the AI’s will cause the AI’s to fall into a Nash equilibria that involves pouring blood on their circuit boards.
There probably is a Nash equilibria that has AI’s pouring blood on their circuit boards, and all the AI’s promise to attack any AI that doesn’t, but you aren’t going to get that equilibrium just by walking around full of blood. You aren’t going to get it even if you happen to cut yourself on a circuit board or deliberately pour blood all over them.
If the AIs are a monolithic entity whose values are universally opposed to those of humans then, yes, we are doomed. But I don’t think this has to be the case. If the post-singularity world consists of an ecosystem of AIs whose mutually competing interests causes them to balance one-another and engage in positive sum games then humanity is preserved not because the AI fears us, but because that is the “norm of behavior” for agents in their society.
Yes, it is scary to imagine a future where humans are no longer at the helm, but I think it is possible to build a future where our values are tolerated and allowed to continue to exist.
By contrast, I am not optimistic about attempts to “extrapolate” human values to an AI capable of acts like turning the entire world into paperclips. Humans are greedy, superstitious and naive. Hopefully our AI descendants will be our better angels and build a world better than any that we can imagine.
So many different AI’s with many different goals, all easily capable of destroying humanity, none that intrinsicly wants to protect humanity.Yet none decides that destroying humanity is a good idea.
Human values are large and arbitrary. The only agent optimising them is humans, and
Suppose you want to make a mechanical clock. You have tried to build one in a metalwork workshop and not got anything to work yet. So you decide to go to the scrap pile and start throwing rocks at it, in the hope that you can make a clock that way. Now maybe it is possible to make a crude clock, at least nudge a beam into a position where it can swing back and forth, by throwing a lot of rocks at just the right angles. You are still being stupid, because you are ignoring effective tools and making the problem needlessly harder for yourself.
I feel that you are doing the same in AI design. Free reign over the space of utility functions, any piece of computer code you care to write is a powerful and general capability. Trying to find Nash equilibria is throwing rocks at a junkyard. Trying to find Nash equilibria without knowing how many AI’s there are or how those AI’s are designed is thowing rocks in the junkyard while blindfolded.
Suppose the AI has developed the tech to upload a human mind into a virtual paradise, and is deciding whether to do it or not. In an aligned AI, you get to write arbitrary code to describe the procedure to a human, and interpret the humans answer. Maybe the human doesn’t have a concept of mind uploading, and the AI is deciding whether to go for “mechanical dream realm” or “artificial heaven” or “like replacing a leg with a wooden one, except the wooden leg is better than your old one, and for all of you not just a leg”. Of course, the raw data of its abstract reasoning files is Gb of gibberish, and making it output anything more usable is non trivial. Maybe the human’s answer depends on how you ask the question. Maybe the human answers “Um maybe, I don’t know”. Maybe the AI spots a flaw in the humans reasoning, does it point it out? The problem of asking a human a question is highly non trivial.
In the general aligned AI paradigm, if you have a formal answer to this problem, you can just type it up and that’s your code.
In your Nash equilibria, once you have a formal answer, you still have to design a nash equilibria that makes AI’s care about that formal answer, and then ensure that real world AI’s fall into that Nash equilibria.
If you hope to get a Nash equilibria that asks humans questions and listens to the answers without a formal description of exactly what you mean by “asks humans questions and listens to the answers”, then could you explain what property singles this behaviour out as a Nash equilibria. From the point of view of abstract maths, there is no obvious way to distinguish a function that converts the AI’s abstract world models into english, from one that translates it into japanese, klingon, or any of trillions of varieties of gibberish. And no the AI doesn’t just “Know english”.
Suppose you start listening to chinese radio. After a while you notice patterns, you get quite good at predicting which meaningless sounds follow which other meaningless sounds. You then go to china. You start repeating strings of meaningless sounds at Chinese people. They respond back with equally meaningless strings of sounds. Over time you get quite good at predicting what the response will be. If you say “Ho yaa” they will usually respond “du sin”, but the old men sometimes respond “du son”. Sometimes the chinese people start jumping up and down or pointing to you. You know a pattern of sounds that will usually cause chinese people to jump up and down, but you have no idea why. Are you giving them good news and their jumping for joy? Are you insulting them and they are hopping mad? Is it some strange chinese custom to jump when they hear a particular word? Are you asking them to jump? ordering them to jump? Telling them that jumping is an exceptionally healthy exercise? Initiating a jumping contest? You have no idea. Maybe you find a string of sounds that makes chinese people give you food, but have no idea if you are telling a sob story, making threats, or offering to pay and then running off.
Now replace the chinese people with space aliens. You don’t even know if they have an emotion of angry. You don’t know if they have emotions at all. You are still quite good at predicting how they will behave. This is the position that an AI is in.
I think this is precisely where we disagree. I believe that we do not have effective tools for writing utility functions and we do have effective tools for designing at least one Nash Equilibrium that preserves human value, namely:
1) All entities have the right to hold and express their own values freely
2) All entities have the right to engage in positive-sum trades with other entities
3) Violence is anathema.
Some more about why I think humans are bad at writing utility functions:
I am the extremely skeptical about anything of the form: We will define a utility function that encodes human values. Machine learning is really good at misinterpreting utility functions written by humans. I think this problem will only get worse with a super-intelligence AI.
I am more optimistic about goals of the form “Learn to ask what humans want”. But I still think these will fail eventually. There are lots of questions even ardent utilitarians would have difficulty answering. For example, “Torture 1 person or give 3^^^3 people a slight headache?”.
I’m not saying all efforts to design friendly AIs are pointless, or that we should willingly release paperclip maximizes on the world. Rather, I believe we boost our chances of preserving human existence and values by encouraging a multi-polar world with lots of competing (but non-violent) AIs. The competing plan of “don’t create AI until we have designed the perfect utility function and hope that our AI is the dominant one” seems like it has a much higher risk of failure, especially in a world where other people will also be developing AI.
Importantly, we have the technology to deploy “build a world where people are mostly free and non-violent” today, and I don’t think we have the technology to “design a utility function that is robust against misinterpretation by a recursively improving AI”.
One additional aside
I must confess the goals of this post are more modest than this. The Nash equilibrium I described is one that preserves human existence and values as they are it does nothing in the domain of creating a virtual paradise where humans will enjoy infinite pleasure (and in fact actively avoids forcing this on people).
I suspect some people will try to build AIs that grant them infinite pleasure, and I do not grudge them this (so long as they do so in a way that respects the rights of others to choose freely). Humans will fall into many camps. Those who just want to be left alone, those who wish to pursue knowledge, those who wish to enjoy paradise. I want to build a world where all of those groups can co-exist without wiping out one-another or being wiped out by a malevolent AI.
The problem is that these sound simple, they are easily expressed in english, but they are pointers to your moral decisions. For example, which lifeforms count as “entities”? If the AI’s decide that every bacteria is an entity that can hold and express its values freely then the result will probably look very weird, and might involve humans being ripped apart to rescue the bacteria inside them. Unborn babies? Brain damaged people? The word entities is a reference to your own concept of a morally valuable being. You have within your own head, a magic black box that can take in descriptions of various things, and decide whether or not they are “entities with the right to hold and express values freely”.
You have a lot of information within your own head about what counts as an entity, what counts as violence ect, that you want to transfer to the AI.
This is especially problematic. The whole reason that any of this is difficult is because humans are not perfect game theoretic agents. Game theoretic agents have a fully specified utility function, and maximise it perfectly. There is no clear line between offering a human something they want, and persuading a human to want something with manipulative marketing. In some limited situations, humans can kind of be approximated as game theoretic agents. However, this approximation breaks down in a lot of circumstances.
I think that there might be a lot of possible Nash equilibria. Any set rules that say to enforce all the rules including this one could be a Nash equilibria. I see a vast space of ways to treat humans. Most of that space contains ways humans wouldn’t like. There could be just one Nash equilibria, or the whole space could be full of Nash equilibria. So either their isn’t a nice Nash equilibria, or we have to pick the nice equilibria from amongst gazillions of nasty ones. In much the same way, if you start picking random letters, either you won’t get a sentence, or if you pick enough you will get a sentence buried in piles of gibberish.
The mostly free and nonvionlent kindof state of affairs is a Nash equilibria in the current world. It is only a Nash equilibria based on a lot of contingent facts about human psycology, culture and socioeconomic situation. Many other human cultures, most historical, have embraced slavery, pillaging and all sorts of other stuff. Humans have a sense of empathy, and all else being equal, would prefer to be nice to other humans. Humans have an inbuilt anger mechanism that automatically retaliates against others, whether or not it benefits themselves. Humans have strongly bounded personal utillities. The current economic situation makes the gains from cooperating relatively large.
So in short, Nash equilibria amongst super-intelligences are very different from Nash equilibria amongst humans. Picking which equilibria a bunch of superintelligences end up in is hard. Humans being nice around the developing AI will not cause the AI’s to magically fall into a nice equilibria, any more than humans being full of blood around the AI’s will cause the AI’s to fall into a Nash equilibria that involves pouring blood on their circuit boards.
There probably is a Nash equilibria that has AI’s pouring blood on their circuit boards, and all the AI’s promise to attack any AI that doesn’t, but you aren’t going to get that equilibrium just by walking around full of blood. You aren’t going to get it even if you happen to cut yourself on a circuit board or deliberately pour blood all over them.