An excerpt from Wise Man’s Fear, by Patrick Rothfuss. Boxing is not safe.
The innkeeper looked up. “I have to admit I don’t see the trouble,” he said apologetically. “I’ve seen monsters, Bast. The Cthaeh falls short of that.”
“That was the wrong word for me to use, Reshi,” Bast admitted. “But I can’t think of a better one. If there was a word that meant poisonous and hateful and contagious, I’d use that.”
Bast drew a deep breath and leaned forward in his chair. “Reshi, the Cthaeh can see the future. Not in some vague, oracular way. It sees all the future. Clearly. Perfectly. Everything that can possibly come to pass, branching out endlessly from the current moment.”
Kvothe raised an eyebrow. “It can, can it?”
“It can,” Bast said gravely. “And it is purely, perfectly malicious. This isn’t a problem for the most part, as it can’t leave the tree. But when someone comes to visit...”
Kvothe’s eyes went distant as he nodded to himself. “If it knows the future perfectly,” he said slowly, “then it must know exactly how a person will react to anything it says.”
Bast nodded. “And it is vicious, Reshi.”
Kvothe continued in a musing tone. “That means anyone influenced by the Cthaeh would be like an arrow shot into the future.”
“An arrow only hits on person, Reshi.” Bast’s dark eyes were hollow and hopeless. “Anyone influenced by the Cthaeh is like a plague ship sailing for a harbor.” Bast pointed at the half-filled sheet Chronicler held in his lap. “If the Sithe knew that existed, they would spare no effort to destroy it. They would kill us for having heard what the Cthaeh said.”
“Because anything carrying the Cthaeh’s influence away from the tree...” Kvothe said, looking down at his hands. He sat silently for a long moment, nodding thoughtfully. “So a young man seeking his fortune goes to the Cthaeh and takes away a flower. The daughter of the king is deathly ill, and he takes the flower to heal her. They fall in love despite the fact that she’s betrothed to the neighboring prince...”
Bast stared at Kvothe, watching blankly as he spoke.
“They attempt a daring moonlight escape,” Kvothe continued. “But he falls from the rooftops and they’re caught. The princess is married against her will and stabs the neighboring prince on their wedding night. The prince dies. Civil war. Fields burned and salted. Famine. Plague...”
“That’s the story of the Fastingsway War,” Bast said faintly.
Hah, I actually quoted much of that same passage on IRC in the same boxing vein! Although as presented the scenario does have some problems:
00:23 < Ralith> that was depressing as fuck 00:24 <@gwern> kind of a magical UFAI, although a LWer would naturally ask why it hasn’t managed to free itself 00:24 < Ralith> gwern: gods, probably 00:24 <@gwern> Ralith: well, in this universe, gods seem killable 00:24 <@gwern> Ralith: so it doesn’t actually resolve the question of how it remains boxed 00:24 < Ralith> gwern: sure, but they’re probably more powerful 00:25 < Ralith> the real question is why isn’t whatever entity is powerful enough to keep it in place also keeping people away from it 00:25 <@gwern> Ralith: well, the only guards listed are faeries, and among the feats attributed to it is starting a war between the mortal and faerie folk, so... 00:26 < Ralith> a faerie is the one who that info came from, yes? 00:26 < Ralith> hardly an objective source 00:26 <@gwern> Ralith: and I would think a faerie reporting that faerie guard it increases credence 00:27 < Ralith> that only faerie guard it? 00:27 <@gwern> Ralith: well, Bast mentions no other guards 00:27 < Ralith> :P 00:28 < Ralith> anything capable of keeping it in that tree should be capable of keeping people away from it 00:28 < Ralith> since the faeries are presumably trying to do both, they can’t be the responsible party. 00:29 <@gwern> who said anything was keeping it in the tree? 00:29 < Ralith> gwern: I did
It is conceivable that there is no (near enough) future where Cthaeh is freed, thus it is powerless to affect its own fate, or is waiting for the right circumstances.
That seemed a little unlikely to me, though. As presented in the book, a minimum of many millennia have passed since the Cthaeh has begun operating, and possibly millions of years (in some frames of reference). It’s had enough power to set planes of existence at war with each other and apparently cause the death of gods. I can’t help but feel that it’s implausible that in all that time, not one forking path led to its freedom. Much more plausible that it’s somehow inherently trapped in or bound to the tree so there’s no meaningful way in which it could escape (which breaks the analogy to an UFAI).
Not by my reading. In your comment, you gave 3 possible explanations, 2 of which are the same (it gets freed, but a long time from ‘now’) and the third a restriction on its foresight which is otherwise arbitrary (‘powerless to affect its own fate’). Neither of these translate to ‘there is no such thing as freedom for it to obtain’.
I’ve come up with what I believe to be an entirely new approach to boxing, essentially merging boxing with FAI theory. I wrote a couple thoughts down about it, but lost my notes, and I also don’t have much time to write this comment, so forgive me if it’s vague or not extremely well reasoned. I also had a couple of tangential thoughts, if I remember them in the course of writing this or I recover my notes later than I’ll put them here as well.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world, so the AI wouldn’t be totally boxed, but I’m inclined to think that we could maintain a very high degree of control over the AI, since the only powers it could ever have would be through communication with us.
The idea came because I recalled a discussion that occurred on TOR.com about why the Cthaeh was in the tree. The general conclusion was that either the Cthaeh was bound by extremely powerful forces, or that the Cthaeh wanted to be in the tree, perhaps because it was instrumentally useful to him. While I found that second explanation implausible in the context of Rothfussland, that discussion led me to realize that almost all approaches towards AI boxing have done so through the first branch of potential boxing solutions, that is, external constraints imposed on the AI, as opposed to the second branch, internal constraints that the AI imposes on itself because of its utility function.
This lead me to think that we should test our capabilities with FAI systems by putting them in a box and giving them limited utility functions, ones that couldn’t possibly lead them to want to manipulate us. So, for example, we could put them in a box and give them a strong desire to stay in the box, along with a desire to tell the truth, and then we could ask it what would happen if we gave it additional desires. We could then test that by building a different AI with those desires and seeing what it did.
It also occurred to me that we would be able to functionally read the AIs mind by looking at its coding. If so, it seems like it’d be much harder for it to fool us than we previously thought. It could potentially just deceive us by manipulating its own thoughts, but I feel like we might be able to stop AIs from doing that somehow, or that the AI would choose not to manipulate its own thoughts that way because it would go insane, or something. It would certainly make the AI’s job much much much harder.
Then I thought that another safety component we should put into testing was a dramatically slowed thinking process for the AI. I’m not sure how exactly this would be done because I’m not very good or experienced with actual technical implementation of ideas, but my idea was to create a crude simulation where we controlled all data inputs and outputs, and we slowed down all processes within the simulation by a factor of 1000, and then building the AI within the simulation. That probably wouldn’t work, given my sucky technical knowhow, but something similar might. Slowing down the AI, perhaps even giving ourselves the ability to pause it without letting it know what’s going on, combined with our ability to “read its mind” might dramatically improve the safeness of boxing.
I’d also like to recommend that before we build a real FAI, we at least try to build a boxed one first, even with a low probability of success. It wouldn’t make things worse in the event that boxing failed, except that it might delay global happiness by a few hours, and in the event that the FAI program was broken we just might save ourselves from menaces to humankind like Clippy.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box.
How do you specify precisely what it means to “stay in the box”? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?
It would not count, we’d want to make the AI not want this almost identical AI to exist. That seems possible, it would be like how I don’t want there to exist an identical copy of me except it eats babies. There are lots of changes to my identity that would be slight but yet that I wouldn’t want to exist.
To be more precise, I’d say that it counts as going outside the box if it does anything except think or talk to the Gatekeepers through the text channel. It can use the text channel to manipulate the Gatekeepers to do things, but it can’t manipulate them to do things that allow it to do anything other than use the text channel. It would, in a certain sense, be partially deontologist, and be unwilling to do things directly other than text the Gatekeepers. How ironic. Lolz.
Also: how would it do this, anyway? It would have to convince the Gatekeepers to convince the scientists to do this, or teach them computer science, or tell them its code. And if the AI started teaching the Gatekeepers computer code or techniques to incapacitate scientists, we’d obviously be aware that something had gone wrong. And, in the system I’m envisioning, the Gatekeepers would be closely monitored by other groups of scientists and bodyguards, and the scientists would be guarded, and the Gatekeepers wouldn’t even have to know who specifically did what on the project.
It would, in a certain sense, be partially deontologist,
And that’s the problem. For in practice a partial deontoligist-partial consequentialist will treat its deontoligical rules as obstacles to achieving what its consequentialist part wants and route around them.
This is both a problem and a solution because it makes the AI weaker. A weaker AI would be good because it would allow us to more easily transition to safer versions of FAI than we would otherwise come up with independently. I think that delaying a FAI is obviously much better than unleashing a UFAI. My entire goal throughout this conversation has been to think of ways that would make hostile FAIs weaker, I don’t know why you think this is a relevant counter objection.
You assert that it will just route around the deontological rules, that’s nonsense and a completely unwarranted assumption, try to actually back up what you’re asserting with arguments. You’re wrong. It’s obviously possible to program things (eg people) such that they’ll refuse to do certain things no matter what the consequences (eg you wouldn’t murder trillions of babies to save billions of trillions of babies, because you’d go insane if you tried because your body has such strong empathy mechanisms and you inherently value babies a lot). This means that we wouldn’t give the AI unlimited control over its source code, of course, we’d make the part that told it to be a deontologist who likes text channels be unmodifiable. That specific drawback doesn’t jive well with the aesthetic of a super powerful AI that’s master of itself and the universe, I suppose, but other than that I see no drawback. Trying to build things in line with that aesthetic actually might be a reason for some of the more dangerous proposals in AI, maybe we’re having too much fun playing God and not enough despair.
I’m a bit cranky in this comment because of the time sink that I’m dealing with to post these comments, sorry about that.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world
What it means for “the AI to be in the box” is generally that the AI’s impacts on the outside world are filtered through the informed consent of the human gatekeepers.
An AI that wants to not impact the outside world will shut itself down. An AI that wants to only impact the outside world in a way filtered through the informed consent of its gatekeepers is probably a full friendly AI, because it understands both its gatekeepers and the concept of informed consent. An AI that simply wants its ‘box’ to remain functional, but is free to impact the rest of the world, is like a brain that wants to stay within a skull- that is hardly a material limitation on the rest of its behavior!
I think you misunderstand what I mean by proposing that the AI wants to stay inside the box. I mean that the AI wouldn’t want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
I think you misunderstand what I mean by proposing that the AI wants to stay inside the box.
I agree that your and my understanding of the phrase “stay inside the box” differ. What I’m trying to do is point out that I don’t think your understanding carves reality at the joints. In order for the AI to stay inside the box, the box needs to be defined in machine-understandable terms, not human-inferrable terms.
I mean that the AI wouldn’t want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
Each half of this sentence has a deep problem. Wouldn’t correctly answering the questions of or otherwise improving the lives of the gatekeepers increase the AI’s power base, since the AI has the ability to communicate with the gatekeepers?
The problem with restrictions like “only be willing to talk” is a restriction on the medium but not the content. So, the AI has a text-only channel that goes just to the gatekeepers- but that doesn’t restrict the content of the messages the AI can send to the gatekeeper. The fictional Cthaeh only wants to talk to its gatekeepers- and yet it still manages to get done what it wants to get done. Words have impacts, and it should be anticipated that the AI picks words because of their impacts.
Sure, the AI can manipulate gatekeepers. But this is a major improvement. You miss my point.
The Cthaeh is very limited by being trapped in its tree and only able to talk to passerby. The UFAI would be limited by being trapped in its text only communication channel. It wouldn’t be able to do things like tell the gatekeepers to plug it into the Internet or to directly control an autonomous army of robots, it would be forced instead to use the gatekeepers as its appendages, and the gatekeepers have severe limitations on brain capacity and physical strength. I think that if we did this and kept careful watch on the gatekeepers and used some other safety measures, boxing an AI would become feasible.
Now, I’m still not advocating using boxed AI willy nilly, but I think using it as a cautionary measure once we think we’ve got FAI figured out would be a very very very good idea, because FAI seems ridiculously hard and there’s no way to know what we don’t know, so there might be unseen problems with whatever final FAI we build. There’s only a risk boxing the supposed FAI would help us, and with existential threats you should take advantage of whatever possibilities you can tilt towards favorable outcomes, even if the probability seems miniscule.
That’s not to say that I think the probability that this would stop a disaster is miniscule, of course (I think it’s rather high, in the event that the FAI was actually UFAI), but just to say that I think it would be a good idea even if my probability assessments are mostly wrong.
Over the AI being embodied in a giant death robot, sure. But limiting the AI to a text-only channel is an old idea, and what Eliezer used for the “let me out of the box” demonstrations.
One of the things to note is that the limitations on the AI don’t limit anything the AI creates. If the AI gives you a blueprint for a nanomachine that purports to cure cancer, but when you make it it turns out to actually be grey goo, then humans lose, even though the AI never left the box. (Note that a worst-case AI is very subtle- suppose the nanomachine actually does cure cancer when built correctly, but a one-in-a-trillion manufacturing error will make it grey goo. Since you’re making trillions of them, you then get grey goo, and humans lose.)
Trying to formally specify “don’t give us any plans that will build an AI” or “don’t give us any plans that will cause humans to lose” without silencing the AI completely is a genuinely difficult affair.
Basically, if you assume that the AI is tricky enough to circumvent any medium restrictions you place on it, then the only way to avoid “humans lose” is to have its goal be “humans win,” which is actually a pretty complicated goal. Expressing that goal in a machine-understandable way is pretty much the FAI problem.
The entire point of Eliezer’s demonstration was that if an AI wants to it can increase its power base even starting from a text only communication system. The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base. It dodges that entire problem, that’s the whole point.
You’ve gotten so used to being scared of boxed AI that you’re reflexively rejecting my idea, I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection. All of my bias-alarms have been going off since your second comment reply, please evaluate yourself and try to distance yourself from your previous beliefs, for the sake of humanity. Also, here is a kitten, unless you want it to die then please reevaluate: http://static.tumblr.com/6t3upxl/Aawm08w0l/khout-kitten-458882.jpeg
Limitations on the AI restrict the range of things that the AI can create. Yes, if we just built whatever the AI said to and the AI was unfriendly then we would lose. Obviously. Yes, if we assume that the UFAI is tricky enough to “circumvent any medium restrictions [we] place on it” then we would lose, practically by definition. But that assumption isn’t warranted. (These super weak strawmen were other indications to me that you might be being biased on this issue.)
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent. You’re either assuming that the AI could deduce the inner workings of our facility and the world and the universe from those text messages, or that the AI had access to a bunch of information about the world already. I disagree with both assumptions, the AIs direct perception could be severely limited and should be, and it isn’t magic so it couldn’t deduce the inner workings of our economy or the nature of nuclear fusion just through deduction (because knowledge comes from experience and induction). (You might not be making either of those assumptions, this is a guess in an attempt to help resolve our disagreement more quickly, sorry if it’s wrong.)
Also, I’m envisioning a system where people that the AI doesn’t know and that the Gatekeepers don’t know about observe their communications. That omitted detail might be another reason for your disagreement, I just assumed it would be apparent for some stupid reason, my apologies.
I think we would have to be careful about what questions we asked the AI. But I see no reason why it could manipulate us automatically and inevitably, no matter what questions we asked it. I think extracting useful information from it would be possible, perhaps even easy. An AI in a box would not be God in a box, and I think that you and other people sometimes accidentally forget that. Just because its dozens or hundreds of times smarter than us doesn’t mean that we can’t win, perhaps win easily, provided that we make adequate preparations for it.
Also, the other suggestions in my comment were really meant to supplement this. If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us. If it’s probably a FAI anyways, then it doesn’t matter if the box happens to be broken. If we’re building multiple AIs and using them to predict what other AIs will do under certain conditions then we can know whether or not AIs can be trusted (use a random number generator at certain stages of the process to prevent it from reading our minds, hide the knowledge of the random number generator). These protections are meant to work with each other, not independently.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us.
If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us. If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on. If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
A truism in software is that code is harder to read than write
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
I of course agree that the thought of being able to read the AI’s mind is ridiculous.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Why are you so confident that the first version of FAI we make will be safe?
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want. When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer. The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
An excerpt from Wise Man’s Fear, by Patrick Rothfuss. Boxing is not safe.
I thought Chronicler’s reply to this was excellent, however. Omniscience does not necessitate omnipotence.
I mean, the UFAI in our world would have an easy time of killing everything. But in their world it’s different.
EDIT: Except that maybe we can be smart and stop the UFAI from killing everything even in our world, see my above comment.
Hah, I actually quoted much of that same passage on IRC in the same boxing vein! Although as presented the scenario does have some problems:
It is conceivable that there is no (near enough) future where Cthaeh is freed, thus it is powerless to affect its own fate, or is waiting for the right circumstances.
That seemed a little unlikely to me, though. As presented in the book, a minimum of many millennia have passed since the Cthaeh has begun operating, and possibly millions of years (in some frames of reference). It’s had enough power to set planes of existence at war with each other and apparently cause the death of gods. I can’t help but feel that it’s implausible that in all that time, not one forking path led to its freedom. Much more plausible that it’s somehow inherently trapped in or bound to the tree so there’s no meaningful way in which it could escape (which breaks the analogy to an UFAI).
Isn’t it what I said?
Not by my reading. In your comment, you gave 3 possible explanations, 2 of which are the same (it gets freed, but a long time from ‘now’) and the third a restriction on its foresight which is otherwise arbitrary (‘powerless to affect its own fate’). Neither of these translate to ‘there is no such thing as freedom for it to obtain’.
Alternatively, perhaps the Cthaeh’s ability to see the future is limited to those possible futures in which it remains in the tree.
Leading to a seriously dystopian variant on Tenchi Muyo!...
I’ve come up with what I believe to be an entirely new approach to boxing, essentially merging boxing with FAI theory. I wrote a couple thoughts down about it, but lost my notes, and I also don’t have much time to write this comment, so forgive me if it’s vague or not extremely well reasoned. I also had a couple of tangential thoughts, if I remember them in the course of writing this or I recover my notes later than I’ll put them here as well.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world, so the AI wouldn’t be totally boxed, but I’m inclined to think that we could maintain a very high degree of control over the AI, since the only powers it could ever have would be through communication with us.
The idea came because I recalled a discussion that occurred on TOR.com about why the Cthaeh was in the tree. The general conclusion was that either the Cthaeh was bound by extremely powerful forces, or that the Cthaeh wanted to be in the tree, perhaps because it was instrumentally useful to him. While I found that second explanation implausible in the context of Rothfussland, that discussion led me to realize that almost all approaches towards AI boxing have done so through the first branch of potential boxing solutions, that is, external constraints imposed on the AI, as opposed to the second branch, internal constraints that the AI imposes on itself because of its utility function.
This lead me to think that we should test our capabilities with FAI systems by putting them in a box and giving them limited utility functions, ones that couldn’t possibly lead them to want to manipulate us. So, for example, we could put them in a box and give them a strong desire to stay in the box, along with a desire to tell the truth, and then we could ask it what would happen if we gave it additional desires. We could then test that by building a different AI with those desires and seeing what it did.
It also occurred to me that we would be able to functionally read the AIs mind by looking at its coding. If so, it seems like it’d be much harder for it to fool us than we previously thought. It could potentially just deceive us by manipulating its own thoughts, but I feel like we might be able to stop AIs from doing that somehow, or that the AI would choose not to manipulate its own thoughts that way because it would go insane, or something. It would certainly make the AI’s job much much much harder.
Then I thought that another safety component we should put into testing was a dramatically slowed thinking process for the AI. I’m not sure how exactly this would be done because I’m not very good or experienced with actual technical implementation of ideas, but my idea was to create a crude simulation where we controlled all data inputs and outputs, and we slowed down all processes within the simulation by a factor of 1000, and then building the AI within the simulation. That probably wouldn’t work, given my sucky technical knowhow, but something similar might. Slowing down the AI, perhaps even giving ourselves the ability to pause it without letting it know what’s going on, combined with our ability to “read its mind” might dramatically improve the safeness of boxing.
I’d also like to recommend that before we build a real FAI, we at least try to build a boxed one first, even with a low probability of success. It wouldn’t make things worse in the event that boxing failed, except that it might delay global happiness by a few hours, and in the event that the FAI program was broken we just might save ourselves from menaces to humankind like Clippy.
How do you specify precisely what it means to “stay in the box”? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?
It would not count, we’d want to make the AI not want this almost identical AI to exist. That seems possible, it would be like how I don’t want there to exist an identical copy of me except it eats babies. There are lots of changes to my identity that would be slight but yet that I wouldn’t want to exist.
To be more precise, I’d say that it counts as going outside the box if it does anything except think or talk to the Gatekeepers through the text channel. It can use the text channel to manipulate the Gatekeepers to do things, but it can’t manipulate them to do things that allow it to do anything other than use the text channel. It would, in a certain sense, be partially deontologist, and be unwilling to do things directly other than text the Gatekeepers. How ironic. Lolz.
Also: how would it do this, anyway? It would have to convince the Gatekeepers to convince the scientists to do this, or teach them computer science, or tell them its code. And if the AI started teaching the Gatekeepers computer code or techniques to incapacitate scientists, we’d obviously be aware that something had gone wrong. And, in the system I’m envisioning, the Gatekeepers would be closely monitored by other groups of scientists and bodyguards, and the scientists would be guarded, and the Gatekeepers wouldn’t even have to know who specifically did what on the project.
And that’s the problem. For in practice a partial deontoligist-partial consequentialist will treat its deontoligical rules as obstacles to achieving what its consequentialist part wants and route around them.
This is both a problem and a solution because it makes the AI weaker. A weaker AI would be good because it would allow us to more easily transition to safer versions of FAI than we would otherwise come up with independently. I think that delaying a FAI is obviously much better than unleashing a UFAI. My entire goal throughout this conversation has been to think of ways that would make hostile FAIs weaker, I don’t know why you think this is a relevant counter objection.
You assert that it will just route around the deontological rules, that’s nonsense and a completely unwarranted assumption, try to actually back up what you’re asserting with arguments. You’re wrong. It’s obviously possible to program things (eg people) such that they’ll refuse to do certain things no matter what the consequences (eg you wouldn’t murder trillions of babies to save billions of trillions of babies, because you’d go insane if you tried because your body has such strong empathy mechanisms and you inherently value babies a lot). This means that we wouldn’t give the AI unlimited control over its source code, of course, we’d make the part that told it to be a deontologist who likes text channels be unmodifiable. That specific drawback doesn’t jive well with the aesthetic of a super powerful AI that’s master of itself and the universe, I suppose, but other than that I see no drawback. Trying to build things in line with that aesthetic actually might be a reason for some of the more dangerous proposals in AI, maybe we’re having too much fun playing God and not enough despair.
I’m a bit cranky in this comment because of the time sink that I’m dealing with to post these comments, sorry about that.
What it means for “the AI to be in the box” is generally that the AI’s impacts on the outside world are filtered through the informed consent of the human gatekeepers.
An AI that wants to not impact the outside world will shut itself down. An AI that wants to only impact the outside world in a way filtered through the informed consent of its gatekeepers is probably a full friendly AI, because it understands both its gatekeepers and the concept of informed consent. An AI that simply wants its ‘box’ to remain functional, but is free to impact the rest of the world, is like a brain that wants to stay within a skull- that is hardly a material limitation on the rest of its behavior!
I think you misunderstand what I mean by proposing that the AI wants to stay inside the box. I mean that the AI wouldn’t want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
I agree that your and my understanding of the phrase “stay inside the box” differ. What I’m trying to do is point out that I don’t think your understanding carves reality at the joints. In order for the AI to stay inside the box, the box needs to be defined in machine-understandable terms, not human-inferrable terms.
Each half of this sentence has a deep problem. Wouldn’t correctly answering the questions of or otherwise improving the lives of the gatekeepers increase the AI’s power base, since the AI has the ability to communicate with the gatekeepers?
The problem with restrictions like “only be willing to talk” is a restriction on the medium but not the content. So, the AI has a text-only channel that goes just to the gatekeepers- but that doesn’t restrict the content of the messages the AI can send to the gatekeeper. The fictional Cthaeh only wants to talk to its gatekeepers- and yet it still manages to get done what it wants to get done. Words have impacts, and it should be anticipated that the AI picks words because of their impacts.
Sure, the AI can manipulate gatekeepers. But this is a major improvement. You miss my point.
The Cthaeh is very limited by being trapped in its tree and only able to talk to passerby. The UFAI would be limited by being trapped in its text only communication channel. It wouldn’t be able to do things like tell the gatekeepers to plug it into the Internet or to directly control an autonomous army of robots, it would be forced instead to use the gatekeepers as its appendages, and the gatekeepers have severe limitations on brain capacity and physical strength. I think that if we did this and kept careful watch on the gatekeepers and used some other safety measures, boxing an AI would become feasible.
Now, I’m still not advocating using boxed AI willy nilly, but I think using it as a cautionary measure once we think we’ve got FAI figured out would be a very very very good idea, because FAI seems ridiculously hard and there’s no way to know what we don’t know, so there might be unseen problems with whatever final FAI we build. There’s only a risk boxing the supposed FAI would help us, and with existential threats you should take advantage of whatever possibilities you can tilt towards favorable outcomes, even if the probability seems miniscule.
That’s not to say that I think the probability that this would stop a disaster is miniscule, of course (I think it’s rather high, in the event that the FAI was actually UFAI), but just to say that I think it would be a good idea even if my probability assessments are mostly wrong.
Over the AI being embodied in a giant death robot, sure. But limiting the AI to a text-only channel is an old idea, and what Eliezer used for the “let me out of the box” demonstrations.
One of the things to note is that the limitations on the AI don’t limit anything the AI creates. If the AI gives you a blueprint for a nanomachine that purports to cure cancer, but when you make it it turns out to actually be grey goo, then humans lose, even though the AI never left the box. (Note that a worst-case AI is very subtle- suppose the nanomachine actually does cure cancer when built correctly, but a one-in-a-trillion manufacturing error will make it grey goo. Since you’re making trillions of them, you then get grey goo, and humans lose.)
Trying to formally specify “don’t give us any plans that will build an AI” or “don’t give us any plans that will cause humans to lose” without silencing the AI completely is a genuinely difficult affair.
Basically, if you assume that the AI is tricky enough to circumvent any medium restrictions you place on it, then the only way to avoid “humans lose” is to have its goal be “humans win,” which is actually a pretty complicated goal. Expressing that goal in a machine-understandable way is pretty much the FAI problem.
The entire point of Eliezer’s demonstration was that if an AI wants to it can increase its power base even starting from a text only communication system. The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base. It dodges that entire problem, that’s the whole point.
You’ve gotten so used to being scared of boxed AI that you’re reflexively rejecting my idea, I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection. All of my bias-alarms have been going off since your second comment reply, please evaluate yourself and try to distance yourself from your previous beliefs, for the sake of humanity. Also, here is a kitten, unless you want it to die then please reevaluate: http://static.tumblr.com/6t3upxl/Aawm08w0l/khout-kitten-458882.jpeg
Limitations on the AI restrict the range of things that the AI can create. Yes, if we just built whatever the AI said to and the AI was unfriendly then we would lose. Obviously. Yes, if we assume that the UFAI is tricky enough to “circumvent any medium restrictions [we] place on it” then we would lose, practically by definition. But that assumption isn’t warranted. (These super weak strawmen were other indications to me that you might be being biased on this issue.)
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent. You’re either assuming that the AI could deduce the inner workings of our facility and the world and the universe from those text messages, or that the AI had access to a bunch of information about the world already. I disagree with both assumptions, the AIs direct perception could be severely limited and should be, and it isn’t magic so it couldn’t deduce the inner workings of our economy or the nature of nuclear fusion just through deduction (because knowledge comes from experience and induction). (You might not be making either of those assumptions, this is a guess in an attempt to help resolve our disagreement more quickly, sorry if it’s wrong.)
Also, I’m envisioning a system where people that the AI doesn’t know and that the Gatekeepers don’t know about observe their communications. That omitted detail might be another reason for your disagreement, I just assumed it would be apparent for some stupid reason, my apologies.
I think we would have to be careful about what questions we asked the AI. But I see no reason why it could manipulate us automatically and inevitably, no matter what questions we asked it. I think extracting useful information from it would be possible, perhaps even easy. An AI in a box would not be God in a box, and I think that you and other people sometimes accidentally forget that. Just because its dozens or hundreds of times smarter than us doesn’t mean that we can’t win, perhaps win easily, provided that we make adequate preparations for it.
Also, the other suggestions in my comment were really meant to supplement this. If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us. If it’s probably a FAI anyways, then it doesn’t matter if the box happens to be broken. If we’re building multiple AIs and using them to predict what other AIs will do under certain conditions then we can know whether or not AIs can be trusted (use a random number generator at certain stages of the process to prevent it from reading our minds, hide the knowledge of the random number generator). These protections are meant to work with each other, not independently.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us.
If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Sure, possible. Just a lot harder than creating an FAI to do it for you—especially when the AI has an incentive to obfuscate.
Why are you so confident that the first version of FAI we make will be safe? Doing both is safest and seems like it would be worth the investment.
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
Humans reading computer code aren’t gods either. How long until an uFAI would get caught if it did stuff like this?
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
That’s not the same as “I’m confident that an AI couldn’t do that”, is it?
At the time, it wasn’t the same.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
The hard part is predicting everything that was entailed by its initial programing and making sure it’s all safe.
That’s right, history of engineering tells us that “provably safe” and “provably secure” systems fail in unanticipated ways.
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
The problem is that the AI could use its appendages to create and use tools that are more powerful than the appendages themselves.
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
I think this conversation has run its course as well, though I intend to pursue a side issue in PMs.