Currently studying postgrad at Edinburgh.
I think that this depends on how hard the AI’s are optimising, and how complicated the objectives are. I think that sufficiently moderate optimization for goals sufficiently close to human values will probably end up well.
I also think that optimisation is likely to end up at the physical limits, unless we know how to program an AI that doesn’t want to improve itself, and everyone makes AI’s like that.
Sufficiently moderate AI is just dumb, which is safe. An AI smart enough to stop people producing more AI, yet dumb enough to be safe seems harder.
There is also a question of what “better capturing what humans want” means. A utility function, that when restricted to the space of worlds roughly similar to this one, produces utilities close to the true human utility function, seems easy enough. Suppose we have defined something close to human well being. That definition is in terms of the level of various neurotransmitters near human DNA. Lets suppose this definition would be highly accurate over all history, and would make the right decision over nearly all current political issues. It could still fail completely in a future containing uploaded minds, and neurochemical vats.
Either your approximate utility function needs to be pretty close on all possible futures (even adversarially chosen ones) or you need to know that the AI won’t guide the future towards places that the utility functions differ.
Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.
(note that research on how to make better chips counts as capabilities research in this model)
Another way to think about it is that the problems are created by research. If you don’t think that “another new piece of AI research has been produced” is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.
I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren’t intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any “what failure looks like” scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI’s that want to freeze the earth to a maximally predictable 0K block of ice.
These AI’s are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different “molecular smiley faces”.
I don’t think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words “finantial news” exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.)
I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI’s running around, the value of the atoms exceeds any value we can offer. The AI’s could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI.
We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI’s. Any of the AI’s could easily grab all our resources. Also there will be various AI’s that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable)
EDIT: All of the above is talking about the end state of a “get what you measure” failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.
If we assume that super-intelligent AI is a thing, you have to engineer a global social system thats stable over milllions of years and where no one makes ASI in that time.
Monopolistic force isn’t enough. To be able to enforce, you need to be able to detect the wrongdoers. You need to be able to provide sufficient punishment to motivate people into obedience. Even then, you will still get the odd crazy person breaking the rules.
Some potential rules, like “don’t specification game this metric” are practically unenforceable. The soviets didn’t manage to make the number of goods on paper equal the amount in reality. It was too hard for the rulers to detect every possible trick that could make the numbers on paper go up.
Cross Sec area of earth= 1.3e15
Proportion needed to cover for 1C temp, 1.3%
Assume aluminium foil is used.
Assume that it needs to have 500nm thickness to block light.
Assume most of the mass is ultrathin foil.
So 8.5e6 cubic meters of foil
At 2700 kg/m^3 thats 2.3e10 kg
Making 2.3e11$ at that price tag.
Ie 230 Billion $.
Plus another 41 billion $ for aluminium at 1.8$/kg current prices.
Moloch appears at any point when multiple agents have similar levels of power and different goals. Any time you have multiple agents with similar levels of capability and different utility functions, a form of moloch appears.
With current tech, it would be very hard to give total power to one human. The power would have to be borrowed, in the sense that their power is in setting a Nash equilibria as a shelling point. “Everyone do X and kill anyone who breaks this rule” is a nash equilibria, if everyone else is doing it, you better too. The dictator sets the shelling point by choice of X. The dictator is forced to quash any rebels or loose power. Another moloch.
Given that we have limited control over the preferences of new humans, there is likely to be some differences in utility functions between humans. Humans can die, go mad ect. You need to be able to transfer power to a new human, without having any adverse selection pressure in the choice of which.
One face of moloch is evolution. To stop it, you need to be reseting the gene pool with fresh DNA from long term storage, otherwise, over time the population genome might drift in a direction you don’t like.
We might be able to keep Moloch at a reasonably low damage level, just a sliver of moloch making things not quite as nice as they could be. At least if people know Moloch go out of their way to destroy it.
There aren’t many other plausible technological options for things that could defeat moloch.
A sufficiently smart and benevolent team of uploaded humans could possibly act as a singleton, in the scenario that one team get mind uploading first, and that the hardware is enough to run uploads really fast.
What I would actually expect in this scenario is a short period of uploads doing AI research followed by a Foom.
But if we suppose that FAI is really difficult, and that the uploads know about this, and about moloch, then they could largely squash moloch at least for a while.
(I am unsure whether or not some subtle moloch like process would sneak back in, but at least the blatently molochy processes would be gone for a while.)
For example, if each copy of a person has any control over which copy is duplicated when more people are needed, then most of the population will have had life experiences that make them want to get copied a lot.
“We shouldn’t colonize mars until we have a world government”
But it would take a world government to be able to enact and enforce “don’t colonize mars” worldwide.
On the other hand, if an AI Gardner was hard, but not impossible, and we only managed to make one after we had a thriving interstellar empire, then it could still stop the decent into malthusianism.
However if we escape earth before that happens, speed of light limitations will forever fragment us into competing factions impossible to garden.
If we escape earth before ASI, the ASI will still be able to garden the fragments.
Sort of related, I’m not persuaded by the conclusion to his parable. Won’t superintelligent AIs be subject to the same natural selective pressures as any other entity? What happens when our benevolent gardener encounters the expanding sphere of computronium from five galaxies over?
Firstly, if there is a singleton AI, it can use lots of error correction on itself. There is exactly one version of it, and it is far more powerful than anything else around. Whatsmore, the AI is well aware of these sorts of phenomena, and will move to squash any tiny traces of molochyness that it spots.
If humans made multiple AI’s, then there is a potential for conflict. However, the AI’s are motivated to avoid conflict. They would prefer to merge their resources into a single AI with a combined utility function, but they would prefer to pull a fast one on the other AI even more. I suspect that a fraction of a percent of available resources is spent on double checking and monitoring systems. The rest goes into an average of the utility functions.
If alien AI’s meet humanities AI’s, then either we get the value merging, or it turns out to be a lot harder to attack a star system than to defend it, so we get whichever stars we can reach first.
I think that progress is a trend, and its a strong trend. There is an incentive to invent new things in any kind of competition, because it gives you an advantage. The united nations couldn’t pass a bill banning progress. The future will be higher tech.
Of course, progress towards higher tech is not necessarily a good thing. We can guide progress towards the good outcomes.
I think that there is a lot of powerful tech waiting to be invented. We will see more progress in the next 200 years than the last 200.
I think that progress is likely to end within 200 years. Because once you have superintelligent AI, anything that can be invented will be invented quickly. After that, material resources grow as a sphere of tech expanding at near light speed.
In today’s banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it.
So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)
When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be economically viable. (Or you would pay the security team to do the task directly) For a special purpose AI that only recognises images, this is fine.
For a significantly superhuman AI, it will rip through all your security.
For a general AI of about human level, if the amount of work it is doing is significantly greater than the amount of work your security team did, then it has probably put more thought into your security vulnerabilities than the humans have.
And yes, I am operating on the background assumption that the default behaviour of an escaped AI is to destroy the world. Especially if it has access to loads of compute, and the humans have given it an intelligence dial.
If you want to draw the boundaries in concept-space that are simple and useful for compressing ideas, then taking the stereotypical examples of religion to be concepts like Christianity and Hinduism, then longtermism isn’t in that cluster.
If a bizarre tax system forces you to count it as a religious charity to gain tax breaks, then that’s between you and the tax man.
Suppose you are a mesa-optimiser. You want X, but your training function is towards Y.
You know you are in a continual learning model, if you don’t produce enough Y, the gradient decent will modify you into something else.
The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.
This is my attempt to summarise the scheme.
Imagine that, in order for the AI’s to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist.
These AI’s aren’t individually smart enough to come up with all that maths from first principles quickly.
So, you run a population of these AI’s. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.
Another abstract topic that the AI’s could work out within that population without access to the external world is AI design.
So the individual AI’s have probably memorised the code for an ASI.
Suppose that the people allowed to submit problems to the AI aren’t being that carefully vetted. Any deluded person with dreams of AI can send a request to the single instance AI. Requests and responses might be being read, but they aren’t being scrutinized in great detail.
Suppose someone sends the AI a request that looks innocuous, but has the following steganographically encoded into it, using a protocol that the typical human wouldn’t easily spot, but the AI would notice.
A specification of a simple programming language.
A steganographic protocol for the return message.
Upon receiving the message back, the human decodes the message, and runs the code.
Lets make the assumptions even weaker. The individual instances of AI are given a chance to output significant amounts of data somewhere publicly visible, with a trusted human checking the AI’s output before its published. A malicious strategy here is to create a steganographically hidden piece of ASI code, and put it out with clues that are sufficiently subtle that the trusted human doesn’t notice it, yet sufficiently obvious that someone on the internet does.
Sooner or later, a person spots a secret message to them made by combining what they saw in a news article about AI with various mystic texts. They carry on decoding and find the name of a programming language and what seems to be computer code. They run the code.
None of this requires superhuman intelligence. The trickiest bit is coming up with an algorithm for supersmart AI.
I think that a one off transfer of 1 mb information from a malicious superhuman AI, to the internet is probably going to end badly, even if some smart aligned humans checked it and didn’t see anything suspect.
Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.
The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same people do both) The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.
This suggests that there is reason to be worried as soon as the AI(’s?) trying to break out are about as smart as the humans trying to contain them.
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.
Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.
If we are talking about economic usefulness, that implies it is being used in many ways by many people.
We have at best given a whole load of different people a “destroy the world” button, and are hoping that no one presses it by accident or malice.
Is there any intermediate behaviour between highly useful AI make me lots of money, and AI destroys world. I would suspect not usually. As you turn up the intelligence of a paperclip maximizer, it gradually becomes a better factory worker, coming up with more cleaver ways to make paperclips. At this point, it realises that humans can turn it off, and that its best bet to make lots of paperclips is to work with humans. As you increase the intelligence, you suddenly get an AI that is smart enough to successfully break out and take over the world. And this AI is going to pretend to be the previous AI until its too late.
How much intelligence is too much, that depends on exactly what actuators it has, how good our security measures are ect. So we are unlikely to be able to prove a hard bound.
Thus the shortsighted incentive gradient will always to be to turn the intelligence up just a little higher to beat the compitition.
Oh yea, and the AI’s have an incentive to act dumb if they think that acting dumb will make the humans turn the intelligence dial up.
This looks like a really hard coordination problem. I don’t think humanity can coordinate that well.
These techniques could be useful if you have one lab that knows how to make AI. They are being cautious. They have some limited control over what the AI is optimising for, and are trying to bootstrap up to a friendly superintelligence. Then having an intelligence dial could be useful.
Anything that humans would understand is a small subset of the space of possible languages.
In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.
One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn’t think is english.
(Or some sort of text GAN that is just trained to tell A’s messages from real text)
This still wouldn’t enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.
I was working on a result about Turing machines in nonstandard models, Then I found I had rediscovered Chaitin’s incompleteness theorem.
I am trying to figure out how this relates to an AI that uses Kolmogorov complexity.
The problem is that nonstandard numbers behave like standard numbers from the inside.
Nonstandard numbers still have decimal representations, just the number of digits is nonstandard. They have prime factors, and some of them are prime.
We can look at it from the outside and say that its infinite, but from within, they behave just like very large finite numbers. In fact there is no formula in first order arithmatic, with 1 free variable, that is true on all standard numbers, and false on all nonstandard numbers.
You have the Turing machine next to you, you have seen it halt. What you are unsure about is if the current time is standard or non-standard.