You’re assuming that people who don’t understand Friendly AI have enough competence to actually build a functioning agent AI in the first place.
Dolores1984
You know, ragequitting is not a fantastic way to endear you to people.
I think cleaning up ‘leftovers’ is definitely a lifelong-friend kind of duty. Like bailing you out of jail at four in the morning...
Hmm. I think what he was proposing was that the AI’s actions be defined as the union of the set of things the AI thinks will accomplish the goal, and the set of things the models won’t veto. Assuming there’s a way to prevent the AI from intentionally compromising the models, the question is how often the AI will come up with a plan for accomplishing goal X which will, completely coincidentally, break the models in undesirable ways.
I think I disagree. The arbitrary and unfeeling processes of the universe can probably be outperformed by anything with a shred of empathy and intellect. You’d just want to be really, really careful, and try to create something better than yourself to hand off control to.
Respectfully, I think you may have overestimated inferential distance: I have read the sequences.
My point wasn’t that the AI wouldn’t produce bad output. I’d say the majority of its proposed solutions would be… let’s call it inadvertently hostile. However, the models would presumably veto those scenarios on sight.
Say we’ve got a two step process: one, generate a whole bunch of valid ways of solving a problem (this would include throwing you out the window, carrying you gently down the stairs, blowing up the building, and everything in between) and two: let the models veto those scenarios they find objectionable.
In that case, most of the solutions, including the window one, would be vetoed. So, in principle, the remaining set would be full of solutions that both satisfy the models, and accomplish the desired goal. My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)
I understand your point, and it’s a good one. Let me think about it and get back to you.
You might want to add a check box for ‘are you in college?’ Most colleges are wildly different, cosmology-wise, from their surrounding environs.
If you were promoting the theory before that point, the police may still have some pointed questions to ask you.
I understood your point. I was simply making a joke.
Leaving aside the question of whether Tool AI as you describe it is possible until I’ve thought more about it:
The idea of a “self-improving algorithm” intuitively sounds very powerful, but does not seem to have led to many “explosions” in software so far (and it seems to be a concept that could apply to narrow AI as well as to AGI).
Looking to the past for examples is a very weak heuristic here, since we have never dealt with software that could write code at a better than human level before. It’s like saying, before the invention of the internal combustion engine, “faster horses have never let you cross oceans before.” Same goes for the assumption that strong AI will resemble extremely narrow AI software tools that already exist in specific regards. It’s evidence, but it’s very weak evidence, and I for one wouldn’t bet on it.
I myself employ a very strong heuristic, from years of trolling the internet: when a user joins a forum and complains about an out-of-character and strongly personal persecution by the moderation staff in the past, there is virtually always more to the story when you look into it.
Okay, after some thought (and final exams), here’s what I think:
If our plan generator returns a set of plans ranked by overall expected utility, the question is basically whether the hit to expected utility provided by Friendliness moves you further down the list than the average joint improbability of whatever sequence of extraneous events would have to happen to break the models in undesirable ways. A second question is how those figures will scale with the complexity of the problem being undertaken. The answer is a firm ‘I don’t know.’ We’d need to run some extensive experiments to find out.
That said, I can think of several ways to address this problem, and, if I were the designer, I would probably take all of them. You could, for example, start the system on easier problems, and have it learn heuristics for things that the models will veto, since you can’t mindhack simple heuristics, and have it apply those rules during the plan generation stage, thus reducing the average distance to Friendliness. You could also have the models audit themselves after each scenario, and determine if they’ve been compromised. Still doesn’t make mindhacking impossible, but it means it has to be a lot more complicated, which non-linearly increases the average distance to mindhacking. There are other strategies to pursue as well. Basically, given that we know the hardest problem we’re planning on throwing this AI at, I would bet there’s a way to make it safe enough to tackle that problem without going apocalyptic.
If you tried to spell that out, the odds you’d make a mistake wouldn’t be incredibly low.
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn’t make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.
I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.
If we’re relying on a future superintelligence to reconstruct our brains, why not make it a little harder?
There’s no reason you couldn’t buy a wearable camera that recorded your inputs and outputs, and back everything up to hard disks in HD. Much cheaper to store than a frozen brain. After a few decades of video, there would have to be more than enough data to do the reconstruction. Then, when you die, you just stick the big stack-o-harddrives into a vault and wait for the future AI overlord to find them, scan them, and put them back together into a person again. Boom. Immortality on the cheap.
But Eliezer Yudkowsky could tell me anything about AI and I would have no way to tell if he was right or not even wrong.
You know, one upside of logic is that, if someone tells you proposition x is true, gives you the data, and shows their steps of reasoning, you can tell whether they’re lying or not. I’m not a hundred percent onboard with Yudkowsky’s AI risk views, but I can at least tell that his line of reasoning is correct as far as it goes. He may be making some unjustified assumptions about AI architecture, but he’s not wrong about there being a threat. If he’s making a mistake of logic, it’s not one I can find. A big, big chunk of mindspace is hostile-by-default.
Well, if I’m doing morningstar rhetoric, I’d best get my game face on.
First, in the paper below, their estimate of the information density of the brain is somewhat wrong. What you actually need is the number of neurons in the brain (10^11), squared, times two bytes for half-precision floating point storage of the strength of the synaptic connection, plus another two bytes to specify the class of neuron, times two, to add fudge factor for white matter doing whatever it is that white matter does, which all works out to 6.4 * 10^23.
Now that we’ve actually set up the problem, let’s see if we can find a way it might still be possible. First, let’s do the obvious thing and forget about the brain stem. Provided you’ve got enough other human brain templates on ice, that can probably be specified in a negligable number of terrabytes, to close enough accuracy that the cerebral cortex won’t miss it. What we really care about are the 2 10^10 neurons in the cerebral cortex. Which brings out overall data usage down to 1 10^22. Not a huge improvement, I grant you, but we’re working. Second, remember that our initial estimate was for the ammount of RAM needed, not the entropy. We’re storing slots in a two dimensional array for each neuron to connect to every other neuron, which will never happen. Assuming 5000 synapses per neuron, that means that 5000 / ( 2* 10^9) of our dataset is going to be zeroes. Let’s apply run-length encoding for zeroes only, and we should see a reduction by a factor of a hundred thousand, conservatively. That brings it down to 10^17 bits, or 11 petabytes.
Now let’s consider that the vast majority of connectomes will never occur inside a human brain. If you generate a random connectome from radio noise and simulate it as an upload, even within the constraints already specified, the result will not be able to walk, talk, reason, or breathe. This doesn’t happen to neurologically healthy adults, so we can deduce that human upbringings and neural topology tend to guide us into a relatively narrow section of connectome space. In fact, I suspect that there’s a good chance that uploading this way would be a process of starting with a generic human template, and tweaking it. Either way, if you took a thousand of those randomly generated minds, I would be very surprised if any of them was anything resembling a human being, so we can probably shave another three orders of magnitude off the number of bits required. That’s 10^15 bits of data, or 113 terrabytes. Not too shabby.
Based on this, and assuming that nights are a wash, and we get no data, in order to specify all those bits in ten years, we would need to capture something like 6.4 megabits a second of entropy, or a little less than a megabyte. This seems a little high. However, there are other tricks you could use to boost the gain. For example, if you have a large enough database of human brain images, you could meaningfully fill in gaps using statistics. For example: if, of the ten thousand people with these eight specific synaptic connections, 99% also have a particular ninth one, it’d be foolish not to include it.
In short, it seems somewhat infeasible, but not strictly impossible. You could augment by monitoring spinal activity, implanting electrodes under your scalp to directly record data on brain region activation, and by the future use of statistical analytics.
Now, actually deducing the states of the brain based on its output, as you said, might be difficult or impossible enough to put an end to the whole game before it starts. Still, it might actually work.
So, aside from whoever this was being a raging asshole, I see a few problems with this idea:
First, I’m reminded of the communist regimes that were intended to start as dictatorships, and then transition over to stateless communism when the transition period was over. Those researchers would need to be REALLY trustworthy.
Second, what conditions are the conscience simulations in? How do they cast their votes? Does this produce any corner cases that may render the emulations unable to cast a veto vote when they want to? Will it try to bribe the models? Are the models aware that they’re emulations? If not, what kind of environment do they think they’re in?
Third, I’m a little terrified by the suggestion of killing trillions of people as part of the normal operating of our AI...