The AI design space near the FAI [draft]

Abstract:

Nearly-FAIs can be more dangerous than AIs with no attempt at friendliness. The FAI effort needs better argument that the attempt at FAI decreases the risks. We are bad at processing threats rationally, and prone to very bad decisions when threatened, akin to running away from unknown into a minefield.

Nearly friendly AIs

Consider AI that truly loves mankind but decides that all of the mankind must be euthanized like an old, sick dog—due to chain of reasoning too long for us to generate when we test our logic of AI, or even comprehend—and proceeds to make a bliss virus—the virus makes you intensely happy, setting your internal utility to infinity; and keeping it so until you die. It wouldn’t even take a very strongly superhuman intelligence to do that kind of thing. Treating life as if it was a disease. It can do so even if it destroys the AI itself. Or consider the FAI that cuts your brain apart to satisfy each hemisphere’s slightly different desires. The AI that just wireheads everyone because it figured we all want it (and worst of all it may be correct).

It seems to me that one can find the true monsters in the design space near to the FAI, and even including the FAIs. And herein lies a great danger: bugged FAIs, the AIs that are close to friendly AI, but are not friendly. It is hard for me to think of a deficiency in friendliness which isn’t horrifically unfriendly (restricting to deficiencies that don’t break AI).

Should we be so afraid of the AIs made without attempts at friendliness?

We need to keep in mind that we have no solid argument that the AIs written without attempt at friendliness—the AIs that predominantly don’t treat mankind in any special way—will necessarily make us extinct.

We have one example of ‘bootstrap’ optimization process—evolution—with not a slightest trace of friendliness in it. What did emerge in the end? We assign pretty low utility to nature, but non-zero, and we are willing to trade resources for preservation of nature—see the endangered species list and international treaties on whaling. It is not perfect, but I think it is fair to say that the single example of bootstrap intelligence we got values the complex dynamical processes for what they are, and prefers to obtain resources without disrupting those processes, even if it is slightly more expensive to do so, and is willing to divert small fraction of the global effort towards helping lesser intelligences.

In light of this, the argument that the AI that is not coded to be friendly is ‘almost certainly’ going to eat you for the raw resources, seems fairly shaky, especially when applied to irregular AIs such as neural networks, crude simulations of human brain’s embryological development, and mind uploads. I didn’t eat my cats yet (nor did they eat each other, nor did my dog eat ’em). I wouldn’t even eat the cow I ate, if I could grow it’s meat in a vat. And I have evolved to eat other intelligences. Growing AIs by competition seems like a very great plan for ensuring unfriendly AI, but even that can fail. (Superhuman AI only needs to divert very little effort to charity to be the best thing ever that happened to us)

It seems to me that when we try to avoid anthropomorphizing superhuman AI, we animize it, or even bacterio-ize it, seeing it as AI gray goo that certainly do the gray goo kind of thing, worst of all, intelligently.

Furthermore, the danger implies a huge conjunction of implied assumptions which all have to be true:

The self improvement must not lead to early AI failure via wireheading, nihilism, or more complex causes (thoroughly confusing itself by discoveries in physics or mathematics, ala MWI and our idea of quantum suicide).

The AI must not prefer for any reason to keep complex structures that it can’t ever restore in the future, over things it can restore.

The AI must want substantial resources right here right now, and be unwilling to trade even a small fraction of resources or small delay for the preservation of mankind. That leaves me wondering what is exactly this thing which we expect the AI to want the resources for. It can’t be anything like quest of knowledge or anything otherwise complex; it got to be some form of paperclips

At this point, I’m not even sure it is even possible to implement a simple goal that AGI won’t find a way to circumvent. We humans do circumvent all of our simple goals: look at birth control, porn, all forms of art, msg in the food, if there’s a goal, there’s a giant industry providing some ways to satisfy it in unintended way. Okay, don’t anthropomorphize, you’d say?

Add the modifications to the chess board evaluation algorithm to the list of legal moves, and the chess AI will break itself. This goes for any kind of game AI. Nobody has ever implemented an example that won’t try to break the goals put in it, if given a chance. Give a theorem prover a chance to edit the axioms, or its truth checker, give the chess AI alteration of board evaluation function as a move, any other example, the AI just breaks itself.

In light of this, it is much less than certain that ‘random’ AI which doesn’t treat humanity in very special way would substantially hurt humanity.

Anthropomorphizing is a bad heuristic, no doubt about that, but assuming that the AGI is in every respect opposite of the only known GI, is much worse heuristic. Especially when speaking of neural network, human brain inspired AGIs. I do get a feeling that this is what is going on with the predictions about AIs. Humans have complex value systems, certainly AGI has ultra simple value system. Humans masturbate their minor goals in many ways (including what we call ‘sex’ but which, in presence of condom, really is not), certainly AGI won’t do that. Humans would rather destroy less complex systems, than more complex ones, and are willing to trade some resources for preservation of more complex systems, certainly AGI won’t do that. It seems that all the strong beliefs about the AGIs which are popular here are easily predicted as the negation of human qualities. Negation of bias is not absence of bias, it’s a worse bias.

AI and its discoveries in physics and mathematics

We don’t know what sorts of physics AI may discover. It’s too easy to argue from ignorance that it can’t come up with physics where our morals won’t make sense. The many worlds interpretation and quantum-suicidal thoughts of Max Tegmark should be a cautionary example. The AI that treats us as special and cares only for us will, inevitably, drag us along as it suffers some sort of philosophical crisis from collision of the notions we hard coded into it, and the physics or mathematics it discovered. The AI that doesn’t treat us as special, and doesn’t hard-code any complex human derived values, may both be better able to survive such shocks to it’s value system, and be less likely to involve us in it’s solutions.

What can we do to avoid stepping onto UFAI when creating FAI

As a software developer, I have to say, not much. We are very, very sloppy at writing specifications and code; those of us who believe we are less sloppy, are especially so—ponder this bit of empirical data, the Dunning-Kruger effect.

The proofs are of limited applicability. We don’t know what sort of stuff the discoveries in physics may throw in. We don’t know that axiomatic system we use to prove things is consistent—free of internal contradictions—and we can’t prove that.

The automated theorem proving has very limited applicability—to easily provable, low level stuff like meeting of deadlines by a garbage collector or correct operation of an adder inside CPU. Even for the software far simpler than AIs—but more complicated than the examples above, the dominant form of development is ‘run and see, if it does not look like it will do what you want, try to fix it’. We can’t even write an autopilot that is safe on the first try. And even very simple agents tend to do very odd and unexpected stuff. I’m not saying this from random person perspective. I am currently a game developer, and I used to develop other kinds of software. I write practical software, including practical agents, that work, and have useful real world applications.

There is a very good chance of blowing up a mine in a minefield, if your mine detector works by hitting the ground. The space near FAI is a minefield of doomsday bombs. (Note, too, the space is multi-dimensional; here are very many ways in which you can step onto a mine, not just north, south, east, and west. The volume of a hypersphere is a vanishing fraction of volume of a cube around that hypersphere, in high number of dimensions; a lot of stuff is counter intuitive)

Fermi Paradox

We don’t see any runaway self sufficient AIs anywhere within observable universe, even though we expect to be able to see them over very big distances. We don’t see any FAI assisted galactic civilizations. One possible route is that the civilizations kill themselves before the AI; other route is that the attempted FAIs reliably kill parent civilizations and themselves. Other possibility is that our model of progression of the intelligence is very wrong and the intelligences never do that—they may stay at home, adding qubits, they may suffer some serious philosophy issues over lack of meaning to the existence, or something much more bizarre. How would logic based decider handle a demonstration that even most basic axioms of arithmetic are ultimately self contradictory? (Note that you can’t know they aren’t). The Fermi paradox raises the probability that there is something very wrong with our visions, and there’s a plenty of ways in which it can be wrong.

Human biases when processing threats

I am not making any strong assertions here to scare you. But evaluate our response to threats—consider the war on terror—update on the biases inherent in the human nature. We are easily swayed by movie plot scenarios, even though those are giant conjunctions. We are easy to scare. When scared, we don’t evaluate probabilities correctly. We take the “crying wolf” as true because all boys who cried wolf for no reason got eaten, or because we were told so as children. We don’t stop and think—is it too dark to see a wolf?. We tend to shoot first and ask questions later. We evolved for very many generations in environment where playing dead quickly makes you dead (on trees) - it is unclear what biases we may have evolved. We seem to have strong bias to act when threatened—cultural or inherited—to ‘do something’. Look how much was overspent on war on terror, the money that could’ve saved far more lives elsewhere, even if the most pessimistic assumptions of terrorism were true. Try to update on the fact that you are running on very flawed hardware that, when threatened, compels you to do something—anything—no matter how justified or not—often to own detriment.

The universe does not grade for effort, in general.