In other words, the agent assigned zero probability to an event, and then it happened.
As far as I understand it, you are proposing that the most realistic failure mode consists of many AI systems, all put into a position of power by humans, and optimizing for their own proxies. Call these Trusted Trial and Error AI’s (TTE)
The distinguishing features of TTE’s are that they were Trusted. A human put them in a position of power. Humans have refined, understood and checked the code enough that they are prepared to put this algorithm in a self driving car, or a stock management system. They are not lab prototypes. They are also Trial and error learners, not one shot learners.
Some More descriptions of what capability range I am considering.
Suppose hypothetically that we had TTE reinforcement learners, a little better than todays state of the art, and nothing beyond that. The AI’s are advanced enough that they can take a mountain of medical data and train themselves to be skilled doctors by trial and error. However they are not advanced enough to figure out how humans work from, say a sequenced genome and nothing more.
Give them control of all the traffic lights in a city, and they will learn how to minimize traffic jams. They will arrange for people to drive in circles rather than stay still, so that they do not count as part of a traffic jam. However they will not do anything outside their preset policy space, like hacking into the traffic light control system of other cities, or destroying the city with nukes.
If such technology is easily available, people will start to use it for things. Some people put it in positions of power, others are more hesitant. As the only way the system can learn to avoid something is through trial and error, the system has to cause a (probably several) public outcrys before it learns not to do so. If no one told the traffic light system that car crashes are bad on simulations or past data, (Alignment failure) Then even if public opinion feeds directly into reward, it will have to cause several car crashes that are clearly its fault before it learns to only cause crashes that can be blamed on someone else. However, deliberately causing crashes will probably get the system shut off or seriously modified.
Note that we are supposing many of these systems existing, so the failures of some, combined with plenty of simulated failures, will give us a good idea of the failure modes.
The space of bad things an AI can get away with is small and highly complex in the space of bad things. An TTE set to reduce crime rates tries making the crime report forms longer, this reduces reported crime, but humans quickly realize what its doing. It would have to do this and be patched many times before it came up with a method that humans wouldn’t notice.
Given Advanced TTE’s as the most advanced form of AI, we might slowly develop a problem, but the deployment of TTE’s would be slowed by the time it takes to gather data and check reliability. Especially given mistrust after several major failures. And I suspect that due to statistical similarity of training and testing, many different systems optimizing different proxies, and humans having the best abstract reasoning about novel situations, and the power to turn the systems off, any discrepancy of goals will be moderately minor. I do not expect such optimization power to be significantly more powerful or less aligned than modern capitalism.
This all assumes that no one will manage to make a linear time AIXI. If such a thing is made, it will break out of any boxes and take over the world. So, we have a social process of adaption to TTE AI, which is already in its early stages with things like self driving cars, and at any time, this process could be rendered irrelevant by the arrival of a super-intelligence.
1)Climate change caused extinction is not on the table. Low tech humans can survive everywhere from the jungle to the arctic. Some humans will survive.
2) I suspect that climate change won’t cause massive social collapse. It might well knock 10% of world GDP, but it won’t stop us having an advanced high tech society. At the moment, its not causing damage on that scale, and I suspect that in a few decades, we will have biotech, renewables or other techs that will make everything fine. I suspect that the damage caused by climate change won’t increase by more than 2 or 3 times in the next 50 years.
3) If you are skilled enough to be a scientist, inventing a solar panel that’s 0.5% more efficient does a lot more good than showing up to protests. Protest’s need many people to work, inventors can change the world by themselves. Policy advisors and academics can suggest action in small groups. Even working a normal job and sending your earnings to a well chosen charity is likely to be more effective.
4) Quite a few people are already working on global warming. It seems unlikely that a problem needs 10,000,001 people working on it to solve, and if only 10,000,000 people work on it, they won’t manage. Most of the really easy work on global warming is already being done. This is not the case with AI risk as of 10 years ago, for example. (It’s got a few more people working on it since then, still nothing like climate change.)
I think the protagonist here should have looked at earth. If there was a technological intelligence on earth that cared about the state of Jupiter’s moons, then it could send rockets there. The most likely scenarios are a disaster bad enough to stop us launching spacecraft, and an AI that only cares about earth.
A super intelligence should assign non negligible probability to the result that actually happened. Given the tech was available, a space-probe containing an uploaded mind is not that unlikely. If such a probe was a real threat to the AI, it would have already blown up all space-probes on the off chance.
The upper bound given on the amount that malicious info can harm you is extremely loose. Malicious info can’t do much harm unless the enemy has a good understanding of the particular system that they are subverting.
Yet policy exploration is an important job. Unless you think that someone posting something on a blog is going to change policy without anyone double-checking it first, we should encourage suggestion of radically new policies.
I would like to propose a model that is more flattering to humans, and more similar to how other parts of human cognition work. When we see a simple textual mistake, like a repeated “the”, we don’t notice it by default. Human minds correct simple errors automatically without consciously noticing that they are doing it. We round to the nearest pattern.
I propose that automatic pattern matching to the closest thing that makes sense is happening at a higher level too. When humans skim semi contradictory text, they produce a more consistent world model that doesn’t quite match up with what is said.
Language feeds into a deeper, sensible world model module within the human brain and GPT2 doesn’t really have a coherent world model.
As your belief about how well AGI is likely to go affects both the likelihood of a bet being evaluated, and the chance of winning, so bets about AGI are likely to give dubious results. I also have substantial uncertainty about the value of money in a post singularity world. Most obviously is everyone getting turned into paperclips, noone has any use for money. If we get a friendly singleton super-intelligence, everyone is living in paradise, whether or not they had money before. If we get an economic singularity, where libertarian ASI(s) try to make money without cheating, then money could be valuable. I’m not sure how we would get that, as an understanding of the control problem good enough to not wipe out humans and fill the universe with bank notes should be enough to make something closer to friendly.
Even if we do get some kind of ascendant economy, given the amount of resources in the solar system (let alone wider universe), its quite possible that pocket change would be enough to live for aeons of luxury.
Given how unclear it is about whether or not the bet will get paid and how much the cash would be worth if it was, I doubt that the betting will produce good info. If everyone thinks that money is more likely than not to be useless to them after ASI, then almost no one will be prepared to lock their capital up until then in a bet.
I suspect that an AGI with such a design could be much safer, if it was hardcoded to believe that time travel and hyperexponentially vast universes were impossible. Suppose that the AGI thought that there was a 0.0001% chance that it could use a galaxies worth of resources to send 10^30 paperclips back in time. Or create a parallel universe containing 3^^^3 paperclips. It will still chase those options.
If starting a long plan to take over the world costs it literally nothing, it will do it anyway. A sequence of short term plans, each designed to make as many paperclips as possible within the next few minutes could still end up dangerous. If the number of paperclips at time t is ct, and its power at time t is pt, then pt+1=2pt, ct=pt would mean that both power and paperclips grew exponentially. This is what would happen if power can be used to gain power and clips at the same time, with minimal loss of either from also pursuing the other.
If power can only be used to gain one thing at a time, and the rate power can grow at is less than the rate of time discount, then we are safer.
This proposal has several ways to be caught out, world wrecking assumptions that aren’t certain, but if used with care, a short time frame, an ontology that considers timetravel impossible, and say a utility function that maxes out at 10 clips, it probably won’t destroy the world. Throw in mild optimization and an impact penalty, and you have a system that relies on a disjunction of shaky assumptions, not a conjunction of them.
It is a CDT agent, or something that doesn’t try to punish you now so you make paperclips last week. A TDT agent might decide to take the policy of killing anyone who didn’t make clips before it was turned on, causing humans that predict this to make clips.
I suspect that it would be possible to build such an agent, prove that there are no weird failure modes left, and turn it on, with a small chance of destroying the world. I’m not sure why you would do that. Once you understand the system well enough to say its safe-ish, what vital info do yo gain from turning it on?
Butterfly effects essentially unpredictable, given your partial knowledge of the world. Sure, you doing homework could cause a tornado in Texas, but it’s equally likely to prevent that. To actually predict which, you would have to calculate the movement of every gust of air around the world. Otherwise your shuffling an already well shuffled pack of cards. Bear in mind that you have no reason to distinguish the particular action of “doing homework” from a vast set of other actions. If you really did know what actions would stop the Texas tornado, they might well look like random thrashing.
What you can calculate is the reliable effects of doing your homework. So, given bounded rationality, you are probably best to base your decisions on those. The fact that this only involves homework might suggest that you have an internal conflict between a part of yourself that thinks about careers, and a short term procrastinator.
Most people who aren’t particularly ethical still do more good than harm. (If everyone looks out for themselves, everyone has someone to look out for them. The law stops most of the bad mutual defections in prisoners dilemmas) Evil genius trying to trick you into doing harm are much rarer than moderately competent nice people trying to get your help to do good.
This is an example of a pascals mugging. Tiny probabilities of vast rewards can produce weird behavior. The best known solution is either a bounded utility function, or a antipascalene agent. (An agent that ignores the best x% and worst y% of possible worlds when calculating expected utilities. It can be money pumped)
Get a pack of cards in which some cards are blue on both sides, and some are red on one side and blue on the other. Pick a random card from the pile. If the subject is shown one side of the card, and its blue, they gain a bit of evidence that the card is blue on both sides. Give them the option to bet on the colour of the other side of the card, before and after they see the first side. Invert the prospect theory curve to get from implicit probability to betting behaviour. The people should perform a larger update in log odds when the pack is mostly one type of card, over when the pack is 50 : 50.
I suspect that if voting reduced your own karma, some people wouldn’t vote. As it becomes obvious that this is happening, more people stop voting, until karma just stops flowing at all. (The people who persistently vote anyway all run out of karma.)
This is making the somewhat dubious assumption that X risks are not so neglected that even a “selfish” individual would work to reduce them. Of course, in the not too unreasonable scenario where the cosmic commons is divided up evenly, and you use your portion to make a vast number of duplicates of yourself, the utility, if your utility is linear in copies of yourself, would be vast. Or you might hope to live for a ridiculously long time in a post singularity world.
The effect that a single person can have on X risks is small, but if they were selfish with no time discounting, it would be a better option than hedonism now. Although a third alternative of sitting in a padded room being very very safe could be even better.
I suspect that the social institutions of Law and Money are likely to become increasingly irrelevant background to the development of ASI.
If you believe that there is a good chance of immortal utopia, and a large chance of paperclips in the next 5 years, the threat that the cops might throw you in jail, (on the off chance that they are still in power) is negligible.
The law is blind to safety.
The law is bureaucratic and ossified. It is probably not employing much top talent, as it’s hard to tell top talent from the rest if you aren’t as good yourself (and it doesn’t have the budget or glamor to attract them). Telling whether an organization is on line for not destroying the world is HARD. The safety protocols are being invented on the fly by each team, the system is very complex and technical and only half built. The teams that would destroy the world aren’t idiots, they are still producing long papers full of maths and talking about the importance of safety a lot. There are no examples to work with, or understood laws.
Likely as not (not really, too much conjugation here), you get some random inspector with a checklist full of thing that sound like a good idea to people who don’t understand the problem. All AI work has to have an emergency stop button that turns the power off. (The idea of an AI circumventing this was not considered by the person who wrote the list).
All the law can really do is tell what public image an AI group want’s to present, provide funding to everyone, and get in everyone’s way. Telling cops to “smash all GPU’s” would have an effect on AI progress. The fund vs smash axis is about the only lever they have. They can’t even tell an AI project from a maths convention from a normal programming project if the project leaders are incentivized to obfuscate.
After ASI, governments are likely only relevant if the ASI was programmed to care about them. Neither paperclippers or FAI will care about the law. The law might be relevant if we had tasky ASI that was not trivial to leverage into a decisive strategic advantage. (An AI that can put a strawberry on a plate without destroying the world, but that’s about the limit of its safe operation.)
Such an AI embodies an understanding of intelligence and could easily be accidentally modified to destroy the world. Such scenarios might involve ASI and timescales long enough for the law to act.
I don’t know how the law can handle something that, can easily destroy the world, has some economic value (if you want to flirt danger) and, with further research could grant supreme power. The discovery must be limited to a small group of people, (law of large number of nonexperts, one will do something stupid). I don’t think the law could notice what it was, after all the robot in-front of the inspector only puts strawberries on plates. They can’t tell how powerful it would be with an unbounded utility function.
Firstly, you are confusing dollars and utils.
If you buy this product for $100, you gain the use of it, at value U to yourself. The workers who made it gain $80, at value U to yourself, because of your utilitarian preferences. Total value U
If the alternative was a product of cost $100, which you value the use of at U, but all the money goes to greedy rich people to be squandered, then you would choose the first.
If the alternative was spending $100 to do something insanely morally important, U[3^^^3], you would do that.
If the alternative was a product of cost $100, that was of value U to yourself, and some of the money would go to people that weren’t that rich U, you would do that.
If you could give the money to people twice as desperate as the workers, at U, you would do that.
There are also good reasons why you might want to discourage monopolies. Any desire to do so is not included in the expected value calculations. But the basic principle is that utilitarianism can never tell you if some action is a good use of a resource, unless you tell it what else that resource could have been used for.
The information needed to describe our particular laws of physics < info needed to describe the concept of “habitable universe” in general < info needed to describe human-like mind.
The biggest slip is the equivocation of the word intelligence. The Kolmogorov complexity of AIXI-tl is quite small, so intelligence’s in that sense of the word are likely to exist in the universal prior.
Humanlike minds have not only the clear mark of evolution, but the mark of stone age tribal interactions across their psyche. An arbitrary mind will be bizarre and alien. Wondering if such a mind might be benevolent is hugely privileging the hypothesis. The most likely way to make a humanlike mind is the process that created humans. So in most of the universes with humanoid deities, those deities evolved. This becomes the simulation hypothesis.
The best hypothesis is still the laws of quantum physics or whatever.
We don’t know what we are missing out on without super intelligence. There might be all sorts of amazing things that we would just never consider to make, or dismiss as obviously impossible, without super intelligence.
I am pointing out that being able to make a FAI that is a bit smarter than you (smartness not really on a single scale, vastly different cognitive architecture, is deep blue smarter than a horse?), involves solving almost all the hard problems in alignment. When we have done all that hard work, we might as well tell it to make itself a trillion times smarter, the cost to us is negligible, the benefit could be huge.
AI can also serve as as a values repository. In most circumstances, values are going to drift over time, possibly due evolutionary forces. If we don’t want to end up as hardscrapple frontier replicators, we need some kind of singleton. Most types of government or committee have their own forms of value drift, and couldn’t keep enough of an absolute grip on power to stop any rebellions for billions of years. I have no ideas other than Friendly ASI oversight for how to stop someone in a cosmically vast society from creating a UFASI. Sufficiently draconian banning of anything at all technological could stop anyone from creating UFASI long term, and also stop most things since the industrial revolution.
The only reasonable scenario that I can see in which FAI is not created and the cosmic commons gets put to good use is if a small group of likeminded individuals, or single person, gains exclusive access to selfrep nanotech and mind uploading. They then use many copies of themselves to police the world. They do all programming and only run code they can formally prove isn’t dangerous. No-one is allowed to touch anything Turing complete.
Both blanks are the identity function.
Here is some psudo code
____def prove(self, p, s, b):
________assert p in self.ps
____def upgrade(self, p1, p2, b):
________if self.prove(p1,“forall s:(exists b2: p2(s,b2))=> (exists b1: p2(s,b1))”, b)
prover.upgrade(PA, nPA, proof)
Where PA is a specific peano arithmatic proof checker. nPA is another proof checker. and ‘proof’ is a proof that anything nPA can prove, PA can prove too.
I consider emotions to be data, not goals. From this point of view, deliberately maximizing happiness for its own sake is a lost purpose. Its like writing extra numbers on your bank balance. If however your happiness was reliably too low, adjusting it upwards with drugs would be sensible. Whats the best level of happiness, the one that produces optimal behavior.
I also find my emotions to be quite weak. And I can set them consciously change them. Just thinking “be happy”, or “be sad” and feeling happy or sad. It actually feels similar to imagining a mental image, sound or smell.