3 years ago I created a map of different ideas about possible AI failures (LW-post, pdf). Recently I converted it into an article “Classification of global catastrophic risks connected with artificial intelligence”. I think there is around 100 failure modes which we could imagine now, and obviously some unimaginable.
However, my classification looks different from the one above: it is a classification of external behaviours, not of internal failure modes. It start from the risks of AI which is below human level, like “narrow-AI viruses” or narrow AI used to create advance weapons, like biological weapons.
Then I look into different risks during AI takeoff and after it. The interesting ones are:
AI kills human to make world simpler.
Two AIs go into war
AI blackmails humanity by a doomsday weapon to get what it needs.
Next is the difference between non-friendly AIs and failures of friendliness. For example, if AI wireheads everybody, it is failure of friendliness, as well as dangerous value learners.
Another source of failures is technical: that is bugs, accumulation of errors, conflicting subgoals and general problems related to complexity. AI’s self-wireheading also belongs here. All this could result into unpredictable halting of AI-Singleton with catastrophic consequences for all humanity about which it now cares.
The last source of possible AI-halting is unresolvable philosophical problems, which effectively halt it. We could imagine several, but not all. Such problems are something like: unsolvable “meaning of life” (or “is-ought” problem) and the problem that the result of computation doesn’t depend on AI’s existence, so it can’t prove to itself that it actually exist.
AI also could encounter more advance alien AI (or its signals) and fail its victim.
I also have one more category of this type: “alternative work days”.
There are two types of work I do: either I sit all day at home alone working on texts about immortality or global risks or something like this.
Or I develop my art collection which requires visits to artists, museums, a lot of photography and conversation with people.
These two works mostly use different parts of my brain: one is working and the other one is resting meanwhile.
The bitter lesson is that methods like search and learning that can scale to more computation eventually win out, as more computation becomes available.
Like evolution and biological neurons’ based brains
But they are somehow recursive: I need to know the real nature of human preferences in order to be sure that other people actually want what I want.
In other words, such preferences about preference have embedded idea about what I think is “preference”: if M. will behave as if she loves me—is it enough? Or it should be her claims of love? Or her emotions? Or coherency of all three?
Side note: what do you think about preferences about preferences of other people?
For example: “I want M. to love me” or “I prefer that everybody will be utilitarian”.
Was it covered somewhere?
It looks like examples are not working here, as any example is an explanation, so it doesn’t count :)
But in some sense it could be similar to the Godel theorem: there are true propositions which can’t be proved by AI (and explanation could be counted as a type of prove).
Ok, another example: there are bad pieces of art, I know it, but I can’t explain why they are bad in formal language.
Yes, I probably understood “indescribable” as a synonymous of “very intense”, not of literary “can’t be described”.
But now I have one more idea about really “indescribable hellworld”: imagine that there is a qualia of suffering which is infinitely worse than anything that any living being ever felt on Earth, and it appears in some hellword, but only in animals or in humans who can’t speak (young children, patients just before death, or it paralises the ability to speak by its intensity and also can’t be remembered—I read historical cases of pain so intense that a person was not able to provide very important information).
So, this hellworld will look almost as our normal world: animals live and die, people live normal and happy (in time-average) lives and also die. But some counterfactual observer which will be able to feel qualia of any living being will find it infinitely more hellish than our world.
We also could live now in such hellworld but don’t know it.
The main reason why it can’t be described as most people don’t believe in qualia, and and observable characteristics of this world will be not hellish. Beings in such world could be also called reverse-p-zombies, as they have much more stronger capability to “experiencing” than ordinary humans.
There is probably a power law of simulation distribution: the more computationally complex the simulation, the more rare it is between all simulation. Example: there is more computer games now running than whole atmospheric meteorological simulations.
If we find our self in simpler simulation, it is more likely concentrated around humans, may be even not all humans, but just a few people, while other are NPC.
I don’t understand why you are ruling them out completely: at least at personal level long intense suffering do exist and happened in mass in the past (cancer patients, concentration camps, witch hunting).
I suggested two different argument against s-risks:
1) Anthropic: s-risks are not dominating type of experience in the universe, or we will be already here.
2) Larger AIs could “save” minds from smaller but evil AIs by creating many copies of such minds and thus creating indexical uncertainty (detailed explanation here), as well as punish copies of such AI for this, and thus discouraging any AI to implement s-risks.
Just before reading this, I got a shower thought that most AI-related catastrophes described previously were of “hyper-rational” type, e.g. the paperclipper, which from first principles decides that it must produce infinitely many paperclips.
However, this is not how ML-based systems fail. They either fall randomly, when encounter something like adversarial example, or fall slowly, by goodhearting some performance measure. Such systems could be also used to create dangerous weapons, e.g. fakenews or viruses, or interact unpredictably with each other.
Future GPT-3 will be protected from hyper-rational failures because of the noisy nature of its answers, so it can’t stick forever to some wrong policy.
A safety measure (MCAS) made the plane actually less safe. I see it as an example of possible type of alignment failure.
As the explosion of the cobalt bomb will be very powerful, probably, around 10 gigaton, it will put almost all the cobalt in the upper levels of atmosphere, may be even in space, from where it will fall all over the earth. So, there will be no localised fallout, but it will slowly falling from sky all over the world for months. For example, large volcanic eruptions were able to cover all sky for months with soot. (There will be obvious local effects too, for the first thousand km.)
Check also this link: https://www.scribd.com/doc/234366347/Doomsday-Men-Cobalt-Bomb-from-the-book-of-P-D-Smith
Note, that 510 tons is only amount of cobalt-60 which appeared in the explosion, but as cross-section is not very high, actually one needs to put many thousand tons of raw cobalt in the bomb. In the links above is said that actual weight of the bomb will be the same as battleship Missuri, that is 20 000 tons. (ITER weights 23 000 tons, typical nuclear power reactor around 1000 tons).
Cobalt bomb may include not only cobalt-60, but other isotopes like gold, strontium or polonium, for higher killing “efficiency”, as they have different half-life and different biological effects.
Also, the biggest effect from the cobalt bomb will be not acute radiation (as in the novel “On the beach”), but food chains poisoning, as some isotopes tend to accumulate in food chains, like strontium.
One can’t convert all cobalt in the bomb into cobalt-60 as cross-section is not very high. I read estimates that the total weight should be around 20 000 tons, see more in the chapter of the book “Doomsday men” by Smith.
I think that actual price will be closer to other nuclear projects ITER or LHC, and be in around 10 billions dollars, but it is still affordable for major nuclear powers..
Maybe the right question here is: is it possible to create more and more strong qualia of pain, or the level of pain is limited.
If maximum level of pain is limited, by, say, 10 of 10, when evil AI have to create complex worlds, like in the story “I have not mouth but I must scream”, trying to affect many our values in most unpleasant combination, that is playing anti-music by pressing different values.
If there is no limits to the possible intensity of pain, the evil AI will invest more in upgrading human brain so it will be able to feel more and more pain. In that case there will be no complexity but just growing intensity. One could see this type of hell in the ending of the last Trier movie “The house that Jack built”. This type of hell is more disturbing to me.
In the Middle Ages the art of torture existed, and this distinction also existed: some tortures were sophisticated, but other were simple but infinitely intense, like the testicle torture.
While reading this, I got a thought, which maybe tangential to all said above, but still is a type of comment.
The thought is that there could be different types of answers to the question “what are human values”:
1) One is formal. What are correct types of presenting human values: words, utility functions, equations, choices.
2) Another is factual: what is actual human preferences of this person or general.
3) Third is procedural: what should I do to learn this person’s preferences.
4) Forth is philosophical: what are “human values” as a type of objects in ontological sense: moral facts, observations, approximations, predictive models, opinions, self-models, qualia, etc.
5) Last is neurological: how said values are preserved in the brain? What is the neurocorelate of value?
Also, what is axiological value of human values—why they are good at all? Are they end point or starting point, sin or blessing?
There is also something like anti-cascade: if everybody believe that something is a) false b) it is bad taste to discuss it, – than it creates a social dynamics that some ideas are getting less discussed or evidence collected in such fields is less known or disregarded as they are collected by “these stupid people”.
Examples: quantum immortality, doomsday argument, parapsychology, UFO, AGI until recently in wider IT community.
Note, that not all presumably false things are also associated with “bad taste”.
As a result, cascades eventually create something like group bounding forces, and a believe in X makes people to choose different “reality bubbles”.
How you would classify other global catastrophic risks according to these types?