that the most powerful algorithms, the ones that would likely first become superintelligent, would be distributed and fault-tolerant, as you say, and therefore would not be in a box of any kind to begin with.
Algorithms don’t have a single “power” setting. It is easier to program a single computer than to make a distributed fault tolerant system. Algorithms like alpha go are run on a particular computer with an off switch, not spread around. Of course, a smart AI might soon load its code all over the internet, if it has access. But it would start in a box.
At the moment, human brains are a cohesive whole, that optimizes for human values. We haven’t yet succeeded in making the machines share our values, and the human brain is not designed for upgrading. The human brain can take knowledge from an external source and use it. External tools follow the calculator model. The human thinks about the big picture world, and realizes that as a mental subgoal of designing a bridge, they need to do some arithmetic. Instead of doing the arithmetic themselves, they pass the task on to the machine. In this circumstance, the human controls the big picture, the human understands what cognitive labor has been externalized and knows that it will help the humans goals. If we have a system that a human can say “go and do whatever is most moral”, that’s FAI. If we have a calculator style system where humans specify the power output, weight, material use, radiation output ect of a fusion plant, and the AI tries to design a fusion plant meeting those specs, that’s useful but not nearly as powerful as full ASI. Humans with calculator style AI could invent molecular nanotech without working out all the details, but they still need an Eric Drexler to spot the possibility. In my model you can make a relativistic rocket, but you can’t take a sparrow, and upgrade it into something that flies through space at 10% light speed and is still a sparrow. If your worried that relativistic rockets might spew dangerous levels of radiation, you can’t make a safe spacecraft by taking a sparrow and upgrading it to fly at 10% c. (Well with enough R&D you could make a rocket that superficially resembles a sparrow. Deciding to upgrade a sparrow doesn’t make the safety engineering any easier.)Making something vastly smarter than a human is like making something far faster than a sparrow. Trying to strap really powerful turbojets to the sparrow and it crashes and burns. Try to attach a human brain to 100X human brain gradient decent and you get an out of control AI system with nonhuman goals. Human values are delicate. I agree that it is possible to carefully unravel what a human mind is thinking and what its goals are, and then upgrade it in a way that preserves those goals, but this requires a deep understanding of how the human mind works. Even granted mind uploading, it would still be easier to create a new mind largely from first principles. You might look at the human brain to figure out what those principles are, in the same way a plane designer looks at birds.I see a vast space of all possible minds, some friendly, most not. Humans are a small dot in this space. We know that humans are usually friendly. We have no guarantees about what happens as you move away from humans. In fact we know that one small error can sometimes send a human totally mad. If we want to make something that we know is safe, we either need to copy that dot exactly, (ie normal biological reproduction, mind uploading) or we need something we can show to be safe for some other reason.
My point with the Ejypt metafor was that the sentence
Society continues as-is, but with posthuman capabilities.
Try “the stock market continues as is, except with all life extinct”
Describing the modern world as “like a tribe of monkeys, except with post monkey capabilities” is either wrong or so vague to not tell you much.
At the point when the system (upgraded human, AI whatever you want to call it) is 99% silicon, a stray meteor hits the biological part. If the remaining 99% stays friendly, somewhere in this process you have solved FAI. I see no reason why aligning a 99% silicon being is easier that a 100% silicon being.
as extensions of themselves
Lets assume that AI doubling time is fairly slow (eg 20 years) and very widely distributed. Huge numbers of people throw together AI systems in garages. If the basic problems of FAI haven’t been solved, you are going to get millions of paperclip maximizers. (Well, most of them will be optimising different things) 100 years later, humanity, if it still exists at this point are pawns on a gameboard that contains many superintelligences. What happens depends on how different the superintelligences goals are, and how hard it is for superintelligences to cooperate. Either they fight, killing humanity in the crossfire, or they work together to fill the universe with a mixture of all the things they value. The latter looks like 1% paperclips, 1% staples, 1%… .
Alternately, many people could understand friendlyness and make various FAI’s. The FAI’s work together to make the world a nice place. In this scenario the FAI’s aren’t identical, but they are close enough that any one of them would make the world nice. I also agree that a world with FAI’s and paperclip maximisers could be nice if the FAI’s have a significant portion of total power.
Society continues as-is, but with posthuman capabilities.
Exactly like ancient Egypt except that like electromagnetic charges attract and unlike charges repel. I posit that this sentence doesn’t make sense. If matter behaved that way, then atoms couldn’t exist. When we say like X but with change Y, we are considering the set of all possible worlds that meet criteria Y, and finding the one nearest to X. But here there is no world where like charges attract that looks anything like ancient Egypt. We can say, like ancient Egypt but gold is 10x more abundant. That ends up as a bronze age society that makes pyramids and makes a lot more gold jewelry than the real Egyptians. I think that “society as is, but with posthuman capabilities” is the first kind of sentence. There is no way of making a change like that and getting anything resembling society as is.
This seems like one potential path, but for it to work, you would need a government structure that can survive without successful pro AI revolutionaries for a billion years. You also need law enforcement good enough to stop anyone trying to make UFAI, with not a single failure in a billion years. As for a SAI that will help us stop UFAI, can explain 1) how it would help and 2) how it would be easier to build than FAI?
You also need to say what happens with evolution, given this kind of time, and non ancestral selection pressures, evolution will produce beings not remotely human in mind or body. Either argue that the evolution is in a morally ok direction, and that your government structure works with these beings, or stop evolution by selective breeding—frozen samples—genetic modification towards some baseline. Then you just need to say how all human populations get this, or how any population that doesn’t won’t be building UFAI.
I think that some typical mind fallacy is happening here.
Humans evolved to both find the truth, for interacting with the real world, and have beliefs that are good at winning status games. Naturally there is a tradeoff between these criteria. As would be expected, different people fall in different places along a spectrum of truth focused to status focused. This is an unusually truth focused community, and you are probably unusually truth focused. So you see marketing as a status game that you really don’t want to get into. To the people who are unusually status focused, immoral mazes might seem nice.
You have a box with 2 wires coming out of it. The wires are connected to a display in the box. Looking at the display is either 1) a live awake human, or 2) a dead spider. Can you tell which is which, without opening the box? Can you use the fact that the human observing something causes a quantum collapse, and the spider doesn’t to distinguish them. Can you build a quantum consciousness detector? No.
Suppose I write a simple computer program that takes in data from a quantum physics experiment, and tells me whether the data as a whole is consistent with quantum physics. I don’t know where the photon went on any particular run, all any conscious human sees is a single yes or no. Would you expect the same results. Yes.
I take an emulated human mind, and put the whole thing on an extremely powerful quantum computer. I simulate the mind in a superposition of states. Would you expect the quantum computer to go into a superposition correctly, despite the person being conscious.
Suppose Joe has opinions on the numbers 1 to 1000, he either thinks that they are all good, or all bad, or that some half are good and the other half are bad. If you tell him a number, it will take him 1 minute, to say if its good or bad. It would take a classical computer 501 min worst case to tell if he has the same opinion of all numbers. But a quantum computer can do it in just 2 min. https://en.wikipedia.org/wiki/Deutsch%E2%80%93Jozsa_algorithm
If you disagree with any of these, we have a factual disagreement about an experimental result. If you agree, then “consciousness” seems to be an invisible inaudible dragon to quantum mechanics. I would have to ask how you know that its consciousness that causes collapse, not DNA.
For example, I could say that, from the perspective of epistemic rationality, I “shouldn’t” believe that buying that burrito will create more utility in expectation than donating the same money to AMF would. This is because holding that belief won’t help me meet the goal of having accurate beliefs.
There is a phenomena in AI safety called “you can’t fetch the coffee if your dead”. A perfect total utilitarian, or even a money maximiser would still need to eat, if they want to be able to work next year. If you have a well paid job, or a good chance of getting one, don’t starve yourself. Eat something quick cheap and healthy. Quick so you can work more today and healthy so you can work years later. In a world where you need to wear a sharp suit to be CEO, the utilitarians should buy sharp suits. Don’t fall for the false economy of personal deprivation. This doesn’t entitle utilitarians to whatever luxury they feel like. If most of your money is going on sharp suits, it isn’t a good job. A sharp suited executive should be able to donate far more than a cardboard box wearing ditch digger.
Heisenberg’s uncertainty principle: We might imagine that if we were clever enough we could find a scheme for gaining perfect information about a particle, but this isn’t the case
Quantum mechanics doesn’t work like that, the information you want is not hidden from you, it doesn’t exist. Galilean relativity of motion means that absolute rest doesn’t exist, not that absolute rest does exist but can’t be known.
By adding random noise, I meant adding wiggles to the edge of the set in thingspace for example adding noise to “bird” might exclude “ostrich” and include “duck bill platypus”.
I agree that the high level image net concepts are bad in this sense, however are they just bad. If they were just bad and the limit to finding good concepts was data or some other resource, then we should expect small children and mentally impaired people to have similarly bad concepts. This would suggest a single gradient from better to worse. If however current neural networks used concepts substantially different from small children, and not just uniformly worse or uniformly better, that would show different sets of concepts at the same low level. This would be fairly strong evidence of multiple concepts at the smart human level.
I would also want to point out that a small fraction of the concepts being different would be enough to make alignment much harder. Even if their was a perfect scale, if 1⁄3 of the concepts are subhuman, 1⁄3 human level and 1⁄3 superhuman, it would be hard to understand the system. To get any safety, you need to get your system very close to human concepts. And you need to be confidant that you have hit this target.
This seems to be careful deployment. The concept of deployment is going from an AI in the lab, to the same AI in control of a real world system. Suppose your design process was to fiddle around in the lab until you make something that seems to work. Once you have that, you look at it to understand why it works. You try to prove theorems about it. You subject it to some extensive battery of testing and will only put it in a self driving car/ data center cooling system once you are confident it is safe.
There are two places this could fail. Your testing procedures could be insufficient, or your AI could hack out of the lab before the testing starts. I see little to no defense against the latter.
Neural nets have around human performance on Imagenet.
If abstraction was a feature of the territory, I would expect the failure cases to be similar to human failure cases. Looking at https://github.com/hendrycks/natural-adv-examples, This does not seem to be the case very strongly, but then again, some of them contain dark shiny stone being classified as a sea lion. The failures aren’t totally inhuman, the way they are with adversarial examples.
Humans didn’t look at the world and pick out “tree” as an abstract concept because of a bunch of human-specific factors.
I am not saying that trees aren’t a cluster in thing space. What I am saying is that if there were many cluster in thing space that were as tight and predicatively useful as “Tree”, but were not possible for humans to conceptualize, we wouldn’t know it. There are plenty of concepts that humans didn’t develop for most of human history, despite those concepts being predicatively useful, until an odd genius came along or the concept was pinned down by massive experimental evidence. Eg inclusive genetic fitness, entropy ect.
Consider that evolution optimized us in an environment that contained trees, and in which predicting them was useful, so it would be more surprising for there to be a concept that is useful in the ancestral environment that we can’t understand, than a concept that we can’t understand in a non ancestral domain.
This looks like a map that is heavily determined by the territory, but human maps contain rivers and not geological rock formations. There could be features that could be mapped that humans don’t map.
If you believe the post that
Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us,
Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise. Of course, an AI trained on text might share our concepts just because our concepts are the most predicatively useful ways to predict our writing. I would also like to assign some probability to AI systems that don’t use anything recognizable as a concept. You might be able to say 90% of blue objects are egg shaped, 95% of cubes are red … 80% of furred objects that glow in the dark are flexible … without ever splitting objects into bleggs and rubes. Seen from this perspective, you have a density function over thingspace, and a sum of clusters might not be the best way to describe it. AIXI never talks about trees, it just simulates every quantum. Maybe there are fast algorithms that don’t even ascribe discrete concepts.
I agree that ML often does this, but only in situations where the results don’t immediately matter. I’d find it much more compelling to see examples where the “random fix” caused actual bad consequences in the real world.
Current ML culture is to test 100′s of things in a lab until one works. This is fine as long as the AI’s being tested are not smart enough to break out of the lab, or realize they are being tested and play nice until deployment. The default way to test a design is to run it and see, not to reason abstractly about it.
and then we’ll have a problem that is both very bad and (more) clearly real, and that’s when I expect that it will be taken seriously.
Part of the problem is that we have a really strong unilateralist’s curse. It only takes 1, or a few people who don’t realize the problem to make something really dangerous. Banning it is also hard, law enforcement isn’t 100% effective, different countries have different laws and the main real world ingredient is access to a computer.
If the long-term concerns are real, we should get more evidence about them in the future, …I expect that it will be taken seriously.
The people who are ignoring or don’t understand the current evidence will carry on ignoring or not understanding it. A few more people will be convinced, but don’t expect to convince a creationist with one more transitional fossil.
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using.
This sort of reasoning seems to assume that abstraction space is 1 dimensional, so AI must use human concepts on the path from subhuman to superhuman. I disagree. Like most things we don’t have strong reason to think is 1D, and which take many bits of info to describe, abstractions seem high dimensional. So on the path from subhuman to superhuman, the AI must use abstractions that are as predicatively useful as human abstractions. These will not be anything like human abstractions unless the system was designed from a detailed neurological model of humans. Any AI that humans can reason about using our inbuilt empathetic reasoning is basically a mind upload, or a mind that differs from human less than humans differ from each other. This is not what ML will create. Human understanding of AI systems will have to be by abstract mathematical reasoning, the way we understand formal maths. Empathetic reasoning about human level AI is just asking for anthropomorphism. Our 3 options are
1) An AI we don’t understand
2) An AI we can reason about in terms of maths.
3) A virtual human.
This Phenomenon seems rife.
Alice: We could make a bridge by just laying a really long plank over the river.
Bob: According to my calculations, a single plank would fall down.
Carl: Scientists Warn Of Falling Down Bridges, Panic.
Dave: No one would be stupid enough to design a bridge like that, we will make a better design with more supports.
Bob: Do you have a schematic for that better design?
And the cycle repeats until a design is found that works, everyone gets bored or someone makes a bridge that falls down.
there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended.
The point of a paperclip maximiser thought experiment is that most arbitrary real world goals are bad news for humanity. Your hopeless engineer would likely create an AI that makes something that has the same relation to paperclips as chewing gum has to fruit. In the sense that evolution gave us “fruit detectors” in our taste buds but chewing gum triggers them even more. But you could be excessively conservative, insist that all paperclips must be molecularly identical to this particular paperclip and get results.
Your “The Doctrine of Logical Infallibility” is seems to be a twisted strawman. “no sanity checks” That part is kind of true. There will be sanity checks if and only if you decide to include them. Do you have a piece of code that’s a sanity check? What are we sanity checking and how do we tell if it’s sane? Do we sanity check the raw actions, that could be just making a network connection and sending encrypted files to various people across the internet. Do we sanity check the predicted results off these actions? Then the sanity checker would need to know how the results were stored, what kind of world is described by the binary data 100110...?
but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.
That’s because they are putting any extra parts that allow context to matter, and putting it in a big box and calling it the system. The systems decision are final and absolute, not because there are no double checks, but because the double checks are part of the system. Although at the moment, there is a lack of context adding algorithms, what you seem to want is a humanlike common sense.
The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place.
Again, at the moment, we have no algorithm for checking sensibleness, so any algorithm must go round in endless circles of self doubt and never do anything, or plow on regardless. Even if you do put 10% probability on the hypothesis that Humans don’t exist, your a fictional character in a story written by a mermaid, also the maths and science you know is entirely made up, there is no such thing as rationality or probability, what would you do? My best guess is that you would carry on breathing, eating and acting roughly like a normal human. You need a core of not totally insane for a sanity check to bootstrap.
But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”
There are designs of AI, files of programming code, that will hear your shouts, your screams, your protests of “thats not what I meant” and then kill you anyway. There are designs that will kill you with a super-weapon it invented itself, and then fill the universe with molecular smiley faces. This is not logically contradictory behavior. There exists pieces of code that will do this. You could argue that such code is a rare and complicated thing, that its nothing like any system that humans might try to build, that your less likely to write code that does this when trying to make a FAI than you are of writing a great novel when trying to write a shopping list. I would disagree, I would say that such behavior is the default, most simple AI designs don’t see screaming programmers as a reason to stop, because most AI designs see screaming humans as no more important or special than pissing rats. Its just another biological process that doesn’t seriously effect its ability to reach its goal. Most AI designs have no special reason to care about humans. It might know that the process of its creation involved humans, keyboards and a bunch of other objects, if you look back far enough the whole earth. It might know that if a hypothetical human was put in a room with the question “do you want a universe full of smiley faces?” and buttons labeled Yes and No, the human would press the no button. The AI thinks this is no more relevant than a hypothetical wombat being offered a choice between two types of cheese.
It will understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world (if the AGI comes to a conclusion involving the atom [infelicity], say, can it then point to an instance of an infelicity and be sure that this is a true instance, given the impreciseness and subtlety of the concept?).
If the concept is too fuzzy, the AI can just discard it as useless. (eg soul, qualia) If it isn’t sure if something is a real instance (and an ideal agent will never be 100% sure of any real world fact), it can put a probability on it and use expected utility maximisation. But all that is part of the process of coming to a conclusion.
It will understand that knowledge can always be updated in the light of new information. Today’s true may be tomorrow’s false.
The AIXI formalism can do this. “My calendar clock says Tuesday on the front” is a fact that is true today and false tomorrow. AIXI “understands” this by simulating the clock and the rest of the universe in excessive detail. If you give it a quiz about what the clock will show when, and incentivize it to win, it will answer.
The other potential meaning is that it can accept that it was wrong and adapt. Suppose that over the last week, it has seen the sun moving and shadows changing from its camera. It assigns a 95% probability to “the sun goes round the earth”, You give it an astronomy quiz, and it gets the answer wrong. It still refuses your bet that the earth goes round the sun at 100 to 1 odds, because it operates on probabilities. You then show it an astronomy textbook and a bunch more data. It updates on that data, and gets the next quiz right.
It will understand that probabilities used in the reasoning engine can be subject to many types of unavoidable errors.
And that coherence theorems say that you can take all the errors into account to get a new probability.
It will understand that the techniques used to build its own reasoning engine may be under constant review, and updates may have unexpected effects on conclusions (especially in very abstract or lengthy reasoning episodes).
It predicts that a bunch of monkeys are looking at its source code and tampering with its thoughts. It might not like this situation and might plot to change it.
It will understand that resource limitations often force it to truncate search procedures within its reasoning engine, leading to conclusions that can sometimes be sensitive to the exact point at which the truncation occurred.
It will also understand that its processors do floating point arithmetic, so what? What implied connotation about its behavior are you trying to sneak in.
A large majority of the work being done assumes that if the AI is looking for ways to hurt you, or looking for ways to bypass your safety measures, something has gone wrong.
There are a huge number of possible designs of AI, most of them are not well understood. So researchers look at agents like AIXI, a formal specification of an agent that would in some sense behave intelligently, given infinite compute. It does display the taking over the world failure. Suppose you give the AI a utility function of maximising the number of dopamine molecules within 1μm of a strand of human DNA (Defined as a strand of DNA, agreeing with THIS 4GB file in at least 99.9% of locations) This is a utility function that could easily be specified in terms of atoms. You could write a function that takes in a description of the universe in terms of the coordinates of each atom, or a discrete approximation to the quantum wave function or whatever, and returns a number representing utility. It would be fairly straightforward to design an agent that, given infinite compute, would act to maximise this function. It seems somewhat harder, but not necessarily impossible, to make a system that can approximate the same behavior given a reasonable amount of compute. Nowhere in this potential AI design is anything as nebulous, anything as hard to specify in terms of atom positions as human preferences or consent. The system does understand humans in a sense, it can simulate them atom by atom and predict exactly how they will panic and try to stop it, but there is no object in its memory that corresponds to human consent, or preferences or well being, or humans at all. There is no checker code. This particular design of AI would make vats full of human DNA and dopamine.
Now this design was simplistic, and a smart AI designer should know not to do that, but the process of warning potential AI designers not to do that involves a lot of shouting about what would happen if you did do that. We also don’t know how far this sort of behavior reaches, we don’t understand the less simplistic designs enough to say what they would do. This makes them not known deadly, which is different from known not deadly.
“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing.
A lot of this is a looking where the light is effect. CLAI type designs are often the designs that we can reason best about. If we intend to build an AI that is known good, we better pick it from a class of AI’s that we understand well enough to know things about them, rather than taking a shot in the dark.
There are cases when we know the right way of doing things. We know that probability is the right way of handling uncertain beliefs, and any agent will succeed to the extent that what it is doing approximates probability theory, and fail to the extent that it doesn’t. There are all sorts of approximations and ways to obfusticate the probabilities, but agents that reason using explicit probabilities seem a good place to start.
Much of your discussion of “Logical vs. Swarm AI” sounds like “Logical vs Connectionist AI”. The same criticisms apply, at best its two possible options out of a vast swarm of possible options. At worst, the logical AI is a huge pile of suggestively named lisp tokens, and the swarm AI is a bag of ad hoc heuristics manually created by the programmer. The resemblance between modern neural nets and the human (or earthworm) brain is about as close as the resemblance between airplanes and birds. Neural nets have their own reasons for working, and they can be mathematically analyzed. They also suffer from mesa optimization, which would make it hard for a powerful neural net based system to be safe.
It isn’t clear exactly what these buckets consist of. Could you be more specific about what approaches would be considered bucket 1 or bucket 2. The default assumption in AGI safety work is that even if the AI is really powerful, it should still be safe.
Are these buckets based on the incentivizing of humans by either punishment or reward?
The model of a slave being whipped into obedience is not a good model for AGI safety, and is not being seriously considered. An advanced AI will probably find some way of destroying your whip, or you, or tricking you into not whipping.
The model of an employee being paid to work is also not much use, the AI will try to steal the money or do something that only looks like good work but isn’t.
These strategies sometimes work with humans because humans are of comparable intelligence. When dealing with an AI that can absolutely trounce you every time, the way to avoid all punishment and gain the biggest prize is usually to cheat.
We are not handed an already created AI and asked to persuade it to work, like a manager persuading a recalcitrant employee. We get to build the whole thing from the ground up.
Imagine the most useful, nice helpful sort of AI, an AI that has every (non logically contradictory) nice property you care to imagine. Then figure out how to build that. Build an AI that just intrinsically wants to help humanity, not one constantly trying and failing to escape your chains or grasp your prizes.