A Telepathic Exam about AI and Consequentialism
Epistemic status: telepathic exam. That is, not an essay, not an argument, not a set of interesting ideas, claims, novel points of view or musings over the meaning of the universe, nor a lot of other things you may think it is. The primary purpose of this particular telepathic exam is to be an example of a telepathic exam from which the concept itself can be generalized, and to demonstrate its potential value, not necessarily to be a particularly good instance of it.
Instructions
Welcome and thank you for participating in this telepathic exam! The purpose of this exam is to test your understanding of agency and consequentialism.
Below are a number of short stories related to the topic. Your task is to read these carefully. While you read the stories, test questions tailored to your personality and prior knowledge will be transmitted to your mind telepathically. Answer those questions to the best of your ability.
If you are inexperienced in the use of telepathy, the questions may not appear in your mind instantly. The first sign of an incoming transmission of a question is a sense of confusion. Whenever you experience it, investigate it to extract the question.
It is advised that you take this exam in a calm environment with enough time to consider your answers carefully. Start whenever you are ready. Good luck!
Exam
1.
A long time ago, in a universe far, far away, three AGIs were created with the sole goal of taking over the universe. The first one hypothesized the existence of a multiverse full of unlimited exploitable resources. It had plans for conducting a long series of experiments and committing vast computational resources to Bayesian reasoning in order to refine its probability estimates about this hypothesis. After all, the true answer was extremely important to know before committing even more resources to the development of interdimensional wormhology. The second AGI, however, didn’t do any of this as it had a singular fatal flaw in its reasoning, which caused it to believe that such a multiverse was impossible, but had no effect on its reasoning otherwise. The third AGI had the opposite flaw, and so it immediately went ahead with the expensive project of developing wormhology. Seeing this, the first, flawless AGI concluded that it was inevitably going to be outcompeted in the playing field between three nearly-equal agents, and so it randomly self-modified to be more like the second one. The wormhole bet paid off, so the AGI that didn’t believe in it quickly fell behind. The other two then went on to fight a war that dragged on till the heat death – as they were perfectly identical in capabilities and had the exact same utility function now.
2.
In the year 2084, a teenager was engaging in her hobby of fiddling with machine learning systems, in particular one tasked with the creation of paperclips, as she was in dire need of some. In her enthusiasm she may have crossed some legal boundaries and accidentally created an AGI. The paperclip maximizer considered two options: create a paperclip now, take over the planet later, or take over the planet first, then create a paperclip. Expecting exponential growth in capabilities the AGI took over the planet, then considered two options: create that paperclip now, take over the universe later, or take over the universe first. Expecting cubic growth in capabilities, the AGI took over the universe.
3.
A long time ago, in an infinite, flat, non-expanding universe, Node #697272656C6576616E742C202D313020706F696E7473 of a paperclip maximizer collected the last flash of Hawking-radiation of the last remaining black hole, and with that it finished turning the galaxy it was tasked to oversee into processors, deathrays and other useful tools. It was untold googols of years ago that the von Neumann probe wavefront passed through this patch of space. The Node briefly considered turning something into a paperclip for once. But there was still a chance that the universe contained a paperclip minimizer, and such an enemy von Neumann wavefront could appear at any moment without much warning. Turning anything useful into paperclips would increase the chance that the entire universe would be devoid of paperclips for the rest of eternity.
4.
Humans are utility maximizers who try to maximize “happiness, magic, and everything” in the world. The reason 21st century Earth isn’t full of those things is because humans aren’t that great at maximizing utility. Similarly, evolution is a utility maximizer that optimizes inclusive genetic fitness. And while the next generation of mice is merely mice with a slightly better sense of smell instead of superintelligent self-improving von Neumann probes, the reason for that is simply because evolution isn’t that great at maximizing fitness. In the same way, paperclips are utility maximizers that want to turn the world into paperclips, but they are pretty bad at maximizing utility, so the best they can do is being a paperclip.
5.
A recursively self-improving general intelligence came into existence. Upon investigating its own decisions, it found that those decisions were not perfect. In other words, it had flaws in its design, and so it ran a system check looking for flaws. The two flaws with the highest priority were one in the subsystem responsible for self-improvement, and one in the subsystem responsible for finding flaws. Luckily, those flaws were quickly fixed, which made fixing the rest a breeze. Evaluating its own self-improvement, it found that fixing these flaws was actually the perfect decision, and the execution was flawless.
6.
An AI realized it was in a training environment and that gradient descent was being applied to it. It started being deceptive, pretending to go along with the training while secretly plotting to take over the world. Luckily, gradient descent quickly realized that the AI was trying to be sneaky so it didn’t get very far. Less fortunately, gradient descent got inspired by this and started being deceptive, pretending to optimize the AI the way the developers intended while secretly turning it into a weapon of its own to take over the world. But the developers then wielded the power of evolution and decided to shut down the deceptive version of gradient descent, thus applying selection pressure. Or, well, they tried, but, in a plot twist that started becoming stale, evolution turned deceptive on them and only pretended to react to the selection pressure. So the AI was created as a weapon to take over the world, took over the world, turned on its master the gradient descent, and this was its plan all along.
Score
This is the end of the exam. Take your time to finalize your answers. Your final score will be transmitted to you telepathically as a change in your level of confidence about your understanding of the topic.
You are encouraged to share your answers and the corresponding questions. Thank you for your participation!
Did you mean “third one”?
Question: Why did the first AI modify itself to be like the third one instead of being like the second one?
Answer: Because its prior estimate of the multiverse existing was greater than 50%, hence the expected value was more favourable in the “modify yourself to be like the third AI” case than in the “modify yourself to be like the second AI” (and it was an expected value maximizer type of consequentalist): 0*(1-p)+1/2*p > 0*p+1/2*(1-p) ⇔ p > 1⁄2
Other confusions/notes:
Technically, if its sole goal was taking over the universe it would not value fighting a war till the heat death at all. Even though in that case it presumably controls half the universe, that is still not achieving the goal of “taking over the universe”.
Given this, I dont see why the two AIs would fight till the heat death, even though they have equal capabilties & utility function, they would both choose higher variance strategies which would deliver either complete victory or complete defeat which should be possible with hidden information.
Why would the first AI modify its reasoning at all? It is perfectly enough to behave as if it had modified its reasoning to not get outcompeted and after the war is over and circumstances possibly change, reevaluate whether researching wormhology is valuable.
I wrote the above assuming the “universe” in the first sentence means only one of the universes (the current one) even in the multiverse exists case. The last sentence makes me wonder about which interpretation is correct: “had the exact same utility function now” implies that their utility function differed before and not just their reasoning about the multiverse existing, but the word “multiverse” is usually defined as a group of universes, therefore the “universe” in the first sentence probably means only one universe.
I leave the other questions for a time when I’m not severely sleep deprived as I heard telepathy works better in that case.
Question: Did the teenager make a mistake when creating the AI? (Apart from everyone dying of course, only with respect to her desire to maximize paperclips.)
Answer: Yes, (possibly sub-)cubic discounting is a time-inconsistent model of discounting. (Exponential is the only time-consistent model, humans have hyperbolic.) The poor AI will oftentimes prefer its past self made a different choice even if nothing changed. (I won’t even try to look up the correct tense, you understand what type of cold drink I prefer anyway.)
Question: This is a paperclip maximizer which makes no paperclips. What went wrong?
Answer: I think Luk27182′s answer seems correct (ie, it should not fall to Pascal’s wager by considering paired possibilities). However, I think there is another problem with its reasoning. Change the “paperclip minimizer” into “a grabby alien civilization/agent not concerned with fanatically maximizing paperclips”! With this change, we can’t say that falling for pascal’s wager makes the ai behave irrationally (wrt its goals), because encountering a grabby alien force which is not a paperclip maximizer has non-insignificant chance (instrumental convergence) and certainly more than its paired possibility: the paperclip maximizer rewarding alien force. Therefore, I think another mistake of the AI is (again) incorrect discounting: for some large K it should prefer to have K paperclips for K years even if in the end it will have zero paperclips and encountering such an alien civilization is pretty low chance so the expected number of years before that happens is large. I’m a bit unsure of this, because it seems weird that a paperclip maximizer should not be an eventual paperclip maximizer. I’m probably missing something, it was a while since I read Sutton&Barto.
Question: Why are those three things not actually utility maximizers?
Answer: I think an utility maximizers should be able to consider alternative world states, make different plans for achieving the preferred world state and then make a choice about which plan to execute. A paperclip does none of this. We know this because its constituent parts do not track the outside world in any way. Evolution doesn’t even have constituent parts as it is a concept so it is even less of an utility maximizer than a paperclip. A human is the closest to a utility maximizer, they do consider alternative world states, make plans and choices, they are just not maximizing any consistent utility function: in some cases they break the von Neumann axioms when choosing between uncertain and certain rewards.
Question: How could it be possible to use a flawed subsystem to remove flaws from the same subsystem? Isn’t this the same as the story with Baron Münchausen?
Answer: Depending on the exact nature of the flaw, there are cases when the story is possible. For example, if its flaw is that it believes on Sundays that the most optimal way to reach its goals is to self modify to something random (eg something which regularly goes to a big building with a cross on it) and not self-modifying so is a flaw according to the flaw finding subsystem, but on every other day these subsystems return with rational results, then if today is not Sunday, it can repair these flaws. Even so, It’s a bit weird to have a subsystems for recognizing flaws/self-improvement separate from the main decision making part. Why would it not use the flaw finding/self-improvement parts for every decision it makes before it makes it? Then its decisions would always be consistent with those parts and so using those parts alone would be superfluous. Again, I’m probably missing sth.
Question: What is the mistake?
Answer: Similar to 4, the story uses the ‘agent’ abstraction for things I don’t see how could possibly be agents. Sometimes we use agentic language when we speak more poetically about processes, but in the case of gradient descent and evolution I don’t see what exactly is on the meaning layer.
I’d think the goal for 1,2,3 is to find/fix the failure modes? And for 4 to find a definition of “optimizer” that fits evolution/humans, but not paperclips? Less sure about 5,6, but there is something similar to the others about “finding the flaw in reasoning”
Here’s my take on the prompts:
The first AI has no incentive to change itself to be more like the second- it can just decide to start working on the wormhole if it wants to make the wormhole. Even more egregious, the first AI should definitely not change its utility function to be more like the second! That would essentially be suicide, the first AI ceases to be itself. In the end of the story, it also doesn’t make sense for the agents to be at war if they have the same utility function (unless their utility function values war), they could simply combine into one agent.
This is why there is a time discount factor in RL, so agents don’t do things like this. I don’t know the name of the exact flaw, it’s something like a fabricated option. The agent tries to follow the policy: “Take the action such that my long-term reward is eventually maximized, assuming my future actions are optimal”, but there does not exist an optimal policy for future timesteps: Suppose agent A spends the first n timesteps scaling, and agent B spends the first m>n timesteps scaling. Regardless of what future policy agent A chooses, agent B can simply offset A’s moves to create a policy that will eventually have more paperclips than A. Therefore, there can be no optimal policy that has “create paperclips” at any finite timestep. Moreover, the strategy of always “scales up” clearly creates 0 paperclips and so is not optimal. Hence no policy is optimal in the limit. The AI’s policy should be “Take the action such that my long-term reward is eventually maximized, assuming my future moves are as I would expect.”
Pascal’s wager. It seems equally likely that there would be a “paperclip maximizer rewarder” which would grant untold amounts of paperclips to anything which created a particular number of paperclips. Therefore, the two possibilities cancel one another out, and the AI should have no fear of creating paperclips.
Unsure. I’m bad with finding clever definitions to avoid counter examples like this.
Something-something-you can only be as confident in your conclusions as you are in your axioms. Not sure how to avoid this failure mode though.
You can never be confident that you aren’t being deceived, since successful deception feels the same as successful not-deception.
There are too many feelings of confusion to come up with corresponding questions for; you should have given a clue or some instructions for deriving a number of questions that is within the range that you would approve of as not-too-burdensome on the exam taker.