Question: Why did the first AI modify itself to be like the third one instead of being like the second one?
Answer: Because its prior estimate of the multiverse existing was greater than 50%, hence the expected value was more favourable in the “modify yourself to be like the third AI” case than in the “modify yourself to be like the second AI” (and it was an expected value maximizer type of consequentalist): 0*(1-p)+1/2*p > 0*p+1/2*(1-p) ⇔ p > 1⁄2
Other confusions/notes:
Technically, if its sole goal was taking over the universe it would not value fighting a war till the heat death at all. Even though in that case it presumably controls half the universe, that is still not achieving the goal of “taking over the universe”.
Given this, I dont see why the two AIs would fight till the heat death, even though they have equal capabilties & utility function, they would both choose higher variance strategies which would deliver either complete victory or complete defeat which should be possible with hidden information.
Why would the first AI modify its reasoning at all? It is perfectly enough to behave as if it had modified its reasoning to not get outcompeted and after the war is over and circumstances possibly change, reevaluate whether researching wormhology is valuable.
I wrote the above assuming the “universe” in the first sentence means only one of the universes (the current one) even in the multiverse exists case. The last sentence makes me wonder about which interpretation is correct: “had the exact same utility function now” implies that their utility function differed before and not just their reasoning about the multiverse existing, but the word “multiverse” is usually defined as a group of universes, therefore the “universe” in the first sentence probably means only one universe.
I leave the other questions for a time when I’m not severely sleep deprived as I heard telepathy works better in that case.
Question: Did the teenager make a mistake when creating the AI? (Apart from everyone dying of course, only with respect to her desire to maximize paperclips.)
Answer: Yes, (possibly sub-)cubic discounting is a time-inconsistent model of discounting. (Exponential is the only time-consistent model, humans have hyperbolic.) The poor AI will oftentimes prefer its past self made a different choice even if nothing changed. (I won’t even try to look up the correct tense, you understand what type of cold drink I prefer anyway.)
Question: This is a paperclip maximizer which makes no paperclips. What went wrong?
Answer: I think Luk27182′s answer seems correct (ie, it should not fall to Pascal’s wager by considering paired possibilities). However, I think there is another problem with its reasoning. Change the “paperclip minimizer” into “a grabby alien civilization/agent not concerned with fanatically maximizing paperclips”! With this change, we can’t say that falling for pascal’s wager makes the ai behave irrationally (wrt its goals), because encountering a grabby alien force which is not a paperclip maximizer has non-insignificant chance (instrumental convergence) and certainly more than its paired possibility: the paperclip maximizer rewarding alien force. Therefore, I think another mistake of the AI is (again) incorrect discounting: for some large K it should prefer to have K paperclips for K years even if in the end it will have zero paperclips and encountering such an alien civilization is pretty low chance so the expected number of years before that happens is large. I’m a bit unsure of this, because it seems weird that a paperclip maximizer should not be an eventual paperclip maximizer. I’m probably missing something, it was a while since I read Sutton&Barto.
Question: Why are those three things not actually utility maximizers?
Answer: I think an utility maximizers should be able to consider alternative world states, make different plans for achieving the preferred world state and then make a choice about which plan to execute. A paperclip does none of this. We know this because its constituent parts do not track the outside world in any way. Evolution doesn’t even have constituent parts as it is a concept so it is even less of an utility maximizer than a paperclip. A human is the closest to a utility maximizer, they do consider alternative world states, make plans and choices, they are just not maximizing any consistent utility function: in some cases they break the von Neumann axioms when choosing between uncertain and certain rewards.
Question: How could it be possible to use a flawed subsystem to remove flaws from the same subsystem? Isn’t this the same as the story with Baron Münchausen?
Answer: Depending on the exact nature of the flaw, there are cases when the story is possible. For example, if its flaw is that it believes on Sundays that the most optimal way to reach its goals is to self modify to something random (eg something which regularly goes to a big building with a cross on it) and not self-modifying so is a flaw according to the flaw finding subsystem, but on every other day these subsystems return with rational results, then if today is not Sunday, it can repair these flaws. Even so, It’s a bit weird to have a subsystems for recognizing flaws/self-improvement separate from the main decision making part. Why would it not use the flaw finding/self-improvement parts for every decision it makes before it makes it? Then its decisions would always be consistent with those parts and so using those parts alone would be superfluous. Again, I’m probably missing sth.
Question: What is the mistake?
Answer: Similar to 4, the story uses the ‘agent’ abstraction for things I don’t see how could possibly be agents. Sometimes we use agentic language when we speak more poetically about processes, but in the case of gradient descent and evolution I don’t see what exactly is on the meaning layer.
Question: Why did the first AI modify itself to be like the third one instead of being like the second one?
Answer: Because its prior estimate of the multiverse existing was greater than 50%, hence the expected value was more favourable in the “modify yourself to be like the third AI” case than in the “modify yourself to be like the second AI” (and it was an expected value maximizer type of consequentalist): 0*(1-p)+1/2*p > 0*p+1/2*(1-p) ⇔ p > 1⁄2
Other confusions/notes:
Technically, if its sole goal was taking over the universe it would not value fighting a war till the heat death at all. Even though in that case it presumably controls half the universe, that is still not achieving the goal of “taking over the universe”.
Given this, I dont see why the two AIs would fight till the heat death, even though they have equal capabilties & utility function, they would both choose higher variance strategies which would deliver either complete victory or complete defeat which should be possible with hidden information.
Why would the first AI modify its reasoning at all? It is perfectly enough to behave as if it had modified its reasoning to not get outcompeted and after the war is over and circumstances possibly change, reevaluate whether researching wormhology is valuable.
I wrote the above assuming the “universe” in the first sentence means only one of the universes (the current one) even in the multiverse exists case. The last sentence makes me wonder about which interpretation is correct: “had the exact same utility function now” implies that their utility function differed before and not just their reasoning about the multiverse existing, but the word “multiverse” is usually defined as a group of universes, therefore the “universe” in the first sentence probably means only one universe.
I leave the other questions for a time when I’m not severely sleep deprived as I heard telepathy works better in that case.
Question: Did the teenager make a mistake when creating the AI? (Apart from everyone dying of course, only with respect to her desire to maximize paperclips.)
Answer: Yes, (possibly sub-)cubic discounting is a time-inconsistent model of discounting. (Exponential is the only time-consistent model, humans have hyperbolic.) The poor AI will oftentimes prefer its past self made a different choice even if nothing changed. (I won’t even try to look up the correct tense, you understand what type of cold drink I prefer anyway.)
Question: This is a paperclip maximizer which makes no paperclips. What went wrong?
Answer: I think Luk27182′s answer seems correct (ie, it should not fall to Pascal’s wager by considering paired possibilities). However, I think there is another problem with its reasoning. Change the “paperclip minimizer” into “a grabby alien civilization/agent not concerned with fanatically maximizing paperclips”! With this change, we can’t say that falling for pascal’s wager makes the ai behave irrationally (wrt its goals), because encountering a grabby alien force which is not a paperclip maximizer has non-insignificant chance (instrumental convergence) and certainly more than its paired possibility: the paperclip maximizer rewarding alien force. Therefore, I think another mistake of the AI is (again) incorrect discounting: for some large K it should prefer to have K paperclips for K years even if in the end it will have zero paperclips and encountering such an alien civilization is pretty low chance so the expected number of years before that happens is large. I’m a bit unsure of this, because it seems weird that a paperclip maximizer should not be an eventual paperclip maximizer. I’m probably missing something, it was a while since I read Sutton&Barto.
Question: Why are those three things not actually utility maximizers?
Answer: I think an utility maximizers should be able to consider alternative world states, make different plans for achieving the preferred world state and then make a choice about which plan to execute. A paperclip does none of this. We know this because its constituent parts do not track the outside world in any way. Evolution doesn’t even have constituent parts as it is a concept so it is even less of an utility maximizer than a paperclip. A human is the closest to a utility maximizer, they do consider alternative world states, make plans and choices, they are just not maximizing any consistent utility function: in some cases they break the von Neumann axioms when choosing between uncertain and certain rewards.
Question: How could it be possible to use a flawed subsystem to remove flaws from the same subsystem? Isn’t this the same as the story with Baron Münchausen?
Answer: Depending on the exact nature of the flaw, there are cases when the story is possible. For example, if its flaw is that it believes on Sundays that the most optimal way to reach its goals is to self modify to something random (eg something which regularly goes to a big building with a cross on it) and not self-modifying so is a flaw according to the flaw finding subsystem, but on every other day these subsystems return with rational results, then if today is not Sunday, it can repair these flaws. Even so, It’s a bit weird to have a subsystems for recognizing flaws/self-improvement separate from the main decision making part. Why would it not use the flaw finding/self-improvement parts for every decision it makes before it makes it? Then its decisions would always be consistent with those parts and so using those parts alone would be superfluous. Again, I’m probably missing sth.
Question: What is the mistake?
Answer: Similar to 4, the story uses the ‘agent’ abstraction for things I don’t see how could possibly be agents. Sometimes we use agentic language when we speak more poetically about processes, but in the case of gradient descent and evolution I don’t see what exactly is on the meaning layer.