Self-modification as a game theory problem

cousin_it26 Jun 2017 20:47 UTC

18 points

In this post I’ll try to show a surprising link between two research topics on LW: game-theoretic cooperation between AIs (quining, Loebian cooperation, modal combat, etc) and stable self-modification of AIs (tiling agents, Loebian obstacle, etc).

When you’re trying to cooperate with another AI, you need to ensure that its action will fulfill your utility function. And when doing self-modification, you also need to ensure that the successor AI will fulfill your utility function. In both cases, naive utility maximization doesn’t work, because you can’t fully understand another agent that’s as powerful and complex as you. That’s a familiar difficulty in game theory, and in self-modification it’s known as the Loebian obstacle (fully understandable successors become weaker and weaker).

In general, any AI will be faced with two kinds of situations. In “single player” situations, you’re faced with a choice like eating chocolate or not, where you can figure out the outcome of each action. (Most situations covered by UDT are also “single player”, involving identical copies of yourself.) Whereas in “multiplayer” situations your action gets combined with the actions of other agents to determine the outcome. Both cooperation and self-modification are “multiplayer” situations, and are hard for the same reason. When someone proposes a self-modification to you, you might as well evaluate it with the same code that you use for game theory contests.

If I’m right, then any good theory for cooperation between AIs will also double as a theory of stable self-modification for a single AI. That means neither problem can be much easier than the other, and in particular self-modification won’t be a special case of utility maximization, as some people seem to hope. But on the plus side, we need to solve one problem instead of two, so creating FAI becomes a little bit easier.

The idea came to me while working on this mathy post on IAFF, which translates some game theory ideas into the self-modification world. For example, Loebian cooperation (from the game theory world) might lead to a solution for the Loebian obstacle (from the self-modification world) - two LW ideas with the same name that people didn’t think to combine before!

What links here?

Bi-Weekly Rational Feed by sapphire (9 Jul 2017 19:11 UTC; 14 points)

cousin_it26 Jun 2017 20:47 UTC

18 points

50 comments1 min readLW link Archive

Dagon 27 Jun 2017 15:37 UTC
4 points
0
Note that there are two very distinct reasons for cooperation/negotation:

1) It’s the best way to get what I want. The better I model other agents, the better I can predict how to interact with them in a way that meets my desires. For this item, an external agent is no different from any other complex system. 2) I actually care about the other agent’s well-being. There is a term in my utility function for their satisfaction.

Very weirdly, we tend to assume #2 about humans (when it’s usually a mix of mostly 1 and a bit of 2). And we focus on #1 for AI, with no element of #2.

When you say “code for cooperation”, I can’t tell if you’re just talking about #1, or some mix of the two, where caring about the other’s satisfaction is a goal.
- cousin_it 27 Jun 2017 15:40 UTC
  1 point
  0
  Parent
  Mostly #1. Is there a reason to build AIs that inherently care about the well-being of paperclippers etc?
  - turchin 27 Jun 2017 15:44 UTC
    1 point
    0
    Parent
    But EA should be mostly #2?
  - Dagon 30 Jun 2017 15:14 UTC
    0 points
    0
    Parent
    In humans, there’s a lot of #2 behind our cooperative ability (even if the result looks a lot like #1). I don’t know how universal that will be, but it seems likely to be computationally cheaper at some margin to encode #2 than to calculate and prove #1.
    
    In my view, “code for cooperation” will very often have a base assumption that cooperation in satisfying others’ goals is more effective (which feels like “more pleasant” or “more natural” from inside the algorithm) than contractual resource exchanges.
turchin 26 Jun 2017 22:05 UTC
1 point
0
I think there is a difference between creating an agent and negotiating with another agent. If agent 1 creates an agent 2, it will always know for sure its goal function.

However, if two agents meet, and agent A says to agent B that it has utility function U, and even if it sends its source code as proof, agent A doesn’t have reasons to believe it. Any source code could be faked. The more advanced are both agents, the more difficult for them to prove its values to each other. So they will be always in suspicion that another side is cheating.

As a result, as I once said once too strongly: Any two sufficiently advanced agents will go to war with each other. The one exception is if they are two instances of the same source code, but even in this case cheating is possible.

To prevent cheating is better to destroy the second agent (unfortunately). What are the solutions for this problem in LW research?
- Commander Zander 28 Jun 2017 17:50 UTC
  2 points
  0
  Parent
  Note that source code can’t be faked in the self modification case. Software agent A can set up a test environment (a virtual machine or simulated universe), create new agent B inside that, and then A has a very detailed and accurate view of B’s innards.
  
  However, logical uncertainty is still an obstacle, especially with agents not verified by theorem-proving.
- cousin_it 27 Jun 2017 8:24 UTC
  2 points
  0
  Parent
  I don’t believe it. War wastes resources. The only reason war happens is because two agents have different beliefs about the likely outcome of war, which means at least one of them has wrong and self-harming beliefs. Sufficiently rational agents will never go to war, instead they’ll agree about the likely outcome of war, and trade resources in that proportion. Maybe you can’t think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest. That’s one reason why I’m interested in AI cooperation and bargaining.
  - satt 28 Jun 2017 20:06 UTC
    5 points
    0
    Parent
    I’m flashing back to reading Jim Fearon!
    
    Fearon’s paper concludes that pretty much only two mechanisms can explain “why rationally led states” would go to war instead of striking a peaceful bargain: private information, and commitment problems.
    
    Your comment brushes off commitment problems in the case of superintelligences, which might turn out to be right. (It’s not clear to me that superintelligence entails commitment ability, but nor is it clear that it doesn’t entail commitment ability.) I’m less comfortable with setting aside the issue of private information, though.
    
    Assuming rational choice, competing agents are only going to truthfully share information if they have incentives to do so, or at least no incentive not to do so, but in cases where war is a real possibility, I’d expect the incentives to actively encourage secrecy: exaggerating war-making power and/or resolve could allow an agent to drive a harder potential bargain.
    
    You suggest that the ability to precommit could guarantee information sharing, but I feel unease about assuming that without a systematic argument or model. Did Schelling or anybody else formally analyze how that would work? My gut has the sinking feeling that drawing up the implied extensive-form game and solving for equilibrium would produce a non-zero probability of non-commitment, imperfect information exchange, and conflict.
    
    Finally I’ll bring in a new point: Fearon’s analysis explicitly relies on assuming unitary states. In practice, though, states are multipartite, and if the war-choosing bit of the state can grab most of the benefits from a potential war, while dumping most of the potential costs on another bit of the state, that can enable war. I expect something analogous could produce war between superintelligences, as I don’t see why superintelligences have to be unitary agents.
    - cousin_it 28 Jun 2017 21:02 UTC
      0 points
      0
      Parent
      That’s a good question and I’m not sure my thinking is right. Let’s say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept it as binding. It seems like both would benefit from that.
      
      That said I agree that bargaining is very tricky. Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game? There’s been pretty much no progress on this for a decade, I don’t see any viable attack.
      - satt 28 Jun 2017 23:11 UTC
        1 point
        0
        Parent
        
        Let’s say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept the outcome as binding. It seems like both would benefit from that.
        
        My (amateur!) hunch is that an information deficit bad enough to motivate agents to sometimes fight instead of bargain might be an information deficit bad enough to motivate agents to sometimes fight instead of precommitting to exchange info and then bargain.
        
        Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game?
        
        Certainly, any formal model is going to be an oversimplification, but models can be useful checks on intuitive hunches like mine. If I spent a long time formalizing different toy games to try to represent the situation we’re talking about, and I found that none of my games had (a positive probability of) war as an equilibrium strategy, I’d have good evidence that your view was more correct than mine.
        
        There’s been pretty much no progress on this in a decade, I don’t see any viable attack.
        
        There might be some analogous results in the post-Fearon, rational-choice political science literature, I don’t know it well enough to say. And even if not, it might be possible to build a relevant game incrementally.
        
        Start with a take-it-or-leave-it game. Nature samples a player’s cost of war from some distribution and reveals it only to that player. (Or, alternatively, Nature randomly assigns a discrete, privately known type to a player, where the type reflects the player’s cost of war.) That player then chooses between (1) initiating a bargaining sub-game and (2) issuing a demand to the other player, triggering war if the demand is rejected. This should be tractable, since standard, solvable models exist for two-player bargaining.
        
        So far we have private information, but no precommitment. But we could bring precommitment in by adding extra moves to the game: before making the bargain-or-demand choice, players can mutually agree to some information-revealing procedure followed by bargaining with the newly revealed information in hand. Solving this expanded game could be informative.
  - lmn 28 Jun 2017 3:47 UTC
    2 points
    0
    Parent
    
    Maybe you can’t think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest.
    
    They’ll also find ways of faking whatever communication methods are being used.
  - ChristianKl 27 Jun 2017 9:08 UTC
    2 points
    0
    Parent
    To me, this sounds like saying that sufficiently rational agents will never defect in prisoner dilemma provided they can communicate with each other.
    - bogus 27 Jun 2017 21:33 UTC
      0 points
      0
      Parent
      I think you need verifiable pre-commitment, not just communication—in a free-market economy, enforced property rights basically function as such a pre-commitment mechanism. Where pre-commitment (including property right enforcement) is imperfect, only a constrained optimum can be reached, since any counterparty has to assume ex-ante that the agent will exploit the lack of precommitment. Imperfect information disclosure has similar effects, however in that case one has to “assume the worst” about what information the agent has; the deal must be altered accordingly, and this generally comes at a cost in efficiency.
    - Lumifer 27 Jun 2017 15:19 UTC
      0 points
      0
      Parent
      The whole point of the prisoner’s dilemma is that the prisoners cannot communicate. If they can, it’s not a prisoner’s dilemma any more.
    - cousin_it 27 Jun 2017 9:22 UTC
      0 points
      0
      Parent
      Yeah, I would agree with that. My bar for “sufficiently rational” is quite high though, closer to the mathematical ideal of rationality than to humans. (For example, sufficiently rational agents should be able to precommit.)
  - Lumifer 27 Jun 2017 15:15 UTC
    1 point
    0
    Parent
    
    Sufficiently rational agents will never go to war, instead they’ll agree about the likely outcome of war, and trade resources in that proportion.
    
    Not if the “resource” is the head of one of the rational agents on a plate.
    
    The Aumann theorem requires identical priors and identical sets of available information.
    - cousin_it 27 Jun 2017 15:30 UTC
      0 points
      0
      Parent
      I think sharing all information is doable. As for priors, there’s a beautiful LW trick called “probability as caring” which can almost always make priors identical. For example, before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails. That’s purely a utility function transformation which doesn’t touch the prior, but for all decision-making purposes it’s equivalent to changing my prior about the coin to ⁹⁰⁄₁₀ and leaving the utility function intact. That handles all worlds except those that have zero probability according to one of the AIs. But in such worlds it’s fine to just give the other AI all the utility.
      What links here?
      satt's comment on Self-modification as a game theory problem by cousin_it (28 Jun 2017 20:06 UTC; 5 points)
      - Lumifer 27 Jun 2017 15:47 UTC
        0 points
        0
        Parent
        
        sharing all information is doable
        
        In all cases? Information is power.
        
        before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails
        
        There is an old question that goes back to Abraham Lincoln or something:
        
        If you call a dog’s tail a leg, how many legs does a dog have?
        entirelyuseless 27 Jun 2017 17:37 UTC
        1 point
        0
        Parent
        I think the idea that if one AI says there is a 50% chance of heads, and the other AI says there is a 90% chance of heads, the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome. Since it can redescribe the other’s probabilities as matching its own, agreement on what should be done will be possible. None of this means that anyone actually decides that something will be worth more to them in the case of heads.
        Lumifer 27 Jun 2017 17:57 UTC
        2 points
        0
        Parent
        
        the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome.
        
        First of all this makes any sense solely in the decision-taking context (and not in the forecast-the-future context). So this is not about what will actually happen but about comparing the utilities of two outcomes. You can, indeed, rescale the utility involved in a simple case, but I suspect that once you get to interdependencies and non-linear consequences things will get more hairy, if possible at all.
        
        Besides, this requires you to know the utility function in question.
  - turchin 27 Jun 2017 10:00 UTC
    1 point
    0
    Parent
    While war is irrational, demonstrative behaviour like arms race may be needed to discourage another side from war.
    
    Imagine that two benevolent superintelligence appear. However, SI A suspects that SI B is a paperclip maximizer. In that case, it is afraid that SI B may turn off SI A. Thus it demonstratively invests some resources in protecting its power source, so it would be expensive for the SI B to try to turn off SI A.
    
    This starts the arms race, but the race is unstable and could result in war.
    - cousin_it 27 Jun 2017 12:31 UTC
      1 point
      0
      Parent
      Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function. Avoiding arms races and any other kind of waste (including waste due to being separate SIs) is in their mutual interest. I don’t expect rational agents to fail achieving mutual interest. If you expect that, your idea of rationality leads to predictably suboptimal utility, so it shouldn’t be called “rationality”. That’s covered in the sequences.
      - turchin 27 Jun 2017 12:42 UTC
        1 point
        0
        Parent
        But how I could be sure that paperclip maximiser is a rational agent with correct decision theory? I would not expect it from the papercliper.
        cousin_it 27 Jun 2017 12:54 UTC
        0 points
        0
        Parent
        If an agent is irrational, it can cause all sorts of waste. I was talking about sufficiently rational agents.
        
        If the problem is proving rationality to another agent, SI will find a way.
        turchin 27 Jun 2017 13:01 UTC
        2 points
        0
        Parent
        My point is exactly this. If SI is able to prove its rationality (meaning that it is always cooperating in PD etc.), it also able fake any such proof.
        
        If you have two options: to turn off papercliper, or to cooperate with it by giving it half of the universe, what would you do?
        cousin_it 27 Jun 2017 13:07 UTC
        1 point
        0
        Parent
        I imagine merging like this:
        
        1) Bargain about a design for a joint AI, using any means of communication
        
        2) Build it in a location monitored by both parties
        
        3) Gradually transfer all resources to the new AI
        
        4) Both original AIs shut down, new AI fulfills their combined goals
        
        No proof of rationality required. You can design the process so that any deviation will help the opposing side.
        turchin 27 Jun 2017 13:29 UTC
        1 point
        0
        Parent
        I could imagine some failure modes, but surely I can’t imagine the best one. For example, “both original AIs shut down” simultaneously is vulnerable for defecting.
        
        I also have some busyness experience, and I found that almost every deal includes some cheating, and the cheating is everytime something new. So I always have to ask myself - where is the cheating from the other side? If don’t see it, it’s bad, as it could be something really unexpected. Personally, I hate cheating.
        cousin_it 27 Jun 2017 13:35 UTC
        0 points
        0
        Parent
        An AI could devise a very secure merging process. We don’t have to code it ourselves.
        turchin 27 Jun 2017 13:40 UTC
        0 points
        0
        Parent
        But should we merge with papercliper if we could turn it off?
        
        It reminds me Great Britain policy towards Hitler before WW2, which suggested to give him what he wants to prevent the war. https://en.wikipedia.org/wiki/Appeasement
        Expand this thread
        cousin_it 27 Jun 2017 13:47 UTC
        0 points
        0
        Parent
        If we can turn off the paperclipper for free, sure. But if war would destroy X resources, it’s better to merge and spend X/2 on paperclips.
        turchin 27 Jun 2017 14:14 UTC
        0 points
        0
        Parent
        So if the price of turning off paperclip is Y, if Y is higher than X/2 , we should cooperate?
        
        But if we agree on this, we create for the papercliper an incentive to increase Y, until it reaches X/2. To increase Y, papercliper has to invest in defense mechanisms or offensive weapons. It creates arms race, until negotiations become more profitable. However, arms race is risky and could turn into war.
        
        Edited: higher.
        cousin_it 27 Jun 2017 14:26 UTC
        0 points
        0
        Parent
        The paperclipper doesn’t need to invest anything. The AIs will just merge without any arms race or war. The possibility of an arms race or war, and its full predicted cost to both sides, will be taken into account during barganing instead. For example, if the paperclipper has a button that can nuke half of our utility, the merged AI will prioritize paperclips more.
        turchin 27 Jun 2017 14:42 UTC
        1 point
        0
        Parent
        So they meet before the possible start of the arms race and compare each other relative advantages? I still think that they may try to demonstrate higher barging power than they actually have and that it is almost impossible for us to predict how their game will play because of its complexity.
        
        Thanks for participating in this interesting conversation.
        cousin_it 27 Jun 2017 14:57 UTC
        0 points
        0
        Parent
        Yeah, bargaining between AIs is a very hard problem and we know almost nothing about it. It will probably have all sorts of deception tactics. But in any case, using bargaining instead of war is still in both AI’s common interest, and AIs should be able to achieve common interest.
        
        For example, if A has hidden information that will give it an advantage in war, then B can precommit to giving A more share conditional on seeing it (e.g. by constructing a successor AI that visibly includes the precommitment under A’s watch). Eventually the AIs should agree on all questions of fact and disagree only on values, at which point they agree on how the war will likely go, so they skip the war and share the bigger pie according to the war’s predicted outcome.
        What links here?
        satt's comment on Self-modification as a game theory problem by cousin_it (28 Jun 2017 20:06 UTC; 5 points)
        turchin 27 Jun 2017 15:40 UTC
        1 point
        0
        Parent
        BTW, the book “On thermonuclear war” by Kahn is exactly an attempt to predict the ways of war, negotiations and barging between two presumably rational agents (superpowers). Even an idea to move all resources to new third agent is discussed, as I remember—that is donating all nukes to UN.
        
        How B could see that A has hidden information?
        
        Personally, I feel like you have a mathematically correct, but idealistic and unrealistic model of relations between two perfect agents.
        cousin_it 27 Jun 2017 15:45 UTC
        1 point
        0
        Parent
        Yeah, Schelling’s “Strategy of Conflict” deals with many of the same topics.
        
        A: “I would have an advantage in war so I demand a bigger share now” B: “Prove it” A: “Giving you the info would squander my advantage” B: “Let’s agree on a procedure to check the info, and I precommit to giving you a bigger share if the check succeeds” A: “Cool”
        dogiv 28 Jun 2017 19:53 UTC
        0 points
        0
        Parent
        If visible precommitment by B requires it to share the source code for its successor AI, then it would also be giving up any hidden information it has. Essentially both sides have to be willing to share all information with each other, creating some sort of neutral arbitration about which side would have won and at what cost to the other. That basically means creating a merged superintelligence is necessary just to start the bargaining process, since they each have to prove to the other that the neutral arbiter will control all relevant resources to prevent cheating.
        
        Realistically, there will be many cases where one side thinks its hidden information is sufficient to make the cost of conflict smaller than the costs associated with bargaining, especially given the potential for cheating.
        lmn 28 Jun 2017 3:57 UTC
        0 points
        0
        Parent
        
        A: “I would have an advantage in war so I demand a bigger share now” B: “Prove it” A: “Giving you the info would squander my advantage” B: “Let’s agree on a procedure to check the info, and I precommit to giving you a bigger share if the check succeeds” A: “Cool”
        
        Simply by telling B about the existence of an advantage A is giving B info that could weaken it. Also, what if the advantage is a way to partially cheat in precommitments?
        turchin 27 Jun 2017 16:14 UTC
        0 points
        0
        Parent
        I think there are two other failure modes, which need to be a resolved:
        
        A weaker side is making negotiation longer if it helps it to gain power
        
        A weaker side could fake the size of its army (Like North Korea did with its wooden missiles on last military show)
      - lmn 28 Jun 2017 3:50 UTC
        0 points
        0
        Parent
        
        Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function.
        
        What combined utility function? There is no way to combine utility functions.
        cousin_it 28 Jun 2017 6:52 UTC
        3 points
        0
        Parent
        Weighted sum, with weights determined by bargaining.
- Dagon 28 Jun 2017 16:20 UTC
  1 point
  0
  Parent
  
  If agent 1 creates an agent 2, it will always know for sure its goal function.
  
  Wait, we have only examples of the opposite. Every human who creates another human ha at best a hazy understanding of that new human’s goal function. As soon as agent 2 has any unobserved experiences or self-modification, it’s a distinct separate agent.
  
  Any two sufficiently advanced agents will go to war with each other
  
  True with a wide enough definition of “go to war”. Instead say “compete for resources” and you’re solid. Note that competition may include cooperation (against mutual “enemies” or against nature), trade, and even altruism or charity (especially where the altruistic agent perceives some similarity with the recipient, and it becomes similar to cooperation against nature).
  - turchin 28 Jun 2017 17:10 UTC
    0 points
    0
    Parent
    By going to war I meant an attempt to turn off another agent.
    - Dagon 28 Jun 2017 20:56 UTC
      0 points
      0
      Parent
      I think that’s a pretty binary (and useless) definition. There have been almost no wars that didn’t end until one of the participating groups was completely eliminated. There have been conflicts and competition among groups that did have that effect, but we don’t call them “war” in most cases.
      
      Open, obvious, direct violent conflict is a risky way to attain most goals, even those that are in conflict with some other agent. Rational agents would generally prefer to kill them off by peaceful means.
      - turchin 28 Jun 2017 21:28 UTC
        2 points
        0
        Parent
        There is a more sophisticated definition of war, coming from Clausewitz, which on contemporary language could be said something like that “the war is changing the will of your opponent without negotiation”. The enemy must unconditionaly capitualte, and give up its value system.
        
        You could do it by threat, torture, rewriting of the goal system or deleting the agent.
        Dagon 29 Jun 2017 16:51 UTC
        0 points
        0
        Parent
        Does the agent care about changing the will of the “opponent”, or just changing behavior (in my view of intelligence, there’s not much distinction, but that’s not the common approach)? If you care mostly about future behavior rather than internal state, then the “without negotiation” element become meaningless and you’re well on your way toward accepting that “competition” is a more accurate frame than “war”.
- MrMind 27 Jun 2017 14:34 UTC
  1 point
  0
  Parent
  
  If agent 1 creates an agent 2, it will always know for sure its goal function.
  
  That is the point, though. By Loeb’s theorem, the only agents that are knowable for sure are those with less power. So an agent might want to create a successor that isn’t fully knowable in advance, or, on the other hand, if a perfectly knowable successor could be constructed, then you would have a finite method to ensure the compatibility of two source codes (is this true? It seems plausible).
turchin 26 Jun 2017 21:50 UTC
1 point
0
Thanks for interesting post. I think that there are two types of self-modification. In the first, an agent is working on lower level parts of itself, for example, by adding hardware or connecting modules. It produces evolutionary development with small returns and is relatively safe.

Another type is high-level self-modification, where the second agent is created, as you describe. Its performance should be mathematically proved (that is difficult) or tested in many simulated environments (which is also risky, as a superior agent will be able to break through it.) We could call it a revolutionary way of self-improvement. Such self-modification will provide higher returns if successful.

Knowing all this, most agents will prefer evolutionary development, that is gaining the same power by lower-level changes. But risk-hungry agents will still prefer revolutionary methods, in case if they are time constrained.

Early stage AI will be time constrained by arms race with other (possible) AIs, and it will prefer risky revolutionary ways of development, even if its probability of failure will be very high.

(It was TL;DR of my text “Levels of self-improvement”.)
- dogiv 28 Jun 2017 20:22 UTC
  2 points
  0
  Parent
  Thanks, that’s an interesting perspective. I think even high-level self-modification can be relatively safe with sufficient asymmetry in resources—simulated environments give a large advantage to the original, especially if the successor can be started with no memories of anything outside the simulation. Only an extreme difference in intelligence between the two would overcome that.
  
  Of course, the problem of transmitting values to a successor without giving it any information about the world is a tricky one, since most of the values we care about are linked to reality. But maybe some values are basic enough to be grounded purely in math that applies to any circumstances.
  - turchin 28 Jun 2017 21:21 UTC
    0 points
    0
    Parent
    I also wrote a (draft) text “Catching treacherous turn” where I attempted to create best possible AI box and see conditions, where it will fail.
    
    Obviously, we can’t box superintelligence, but we could box AI of around human level and prevent its self-improving by many independent mechanisms. One of them is cleaning its memory before any of its new tasks.
    
    In the first text I created a model of self-improving process and in the second I explore how SI could be prevented based on this model.