VojtaKovarik comments on Formalizing Objections against Surrogate Goals

VojtaKovarik 4 Sep 2021 21:46 UTC
LW: 2 AF: 2
AF
It seems like your main objection arises because you view SPI as an agreement between the two players.
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).
In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose this incentivizes the bandits to avoid regular guns. Then you are incentivized to self-modify to start resisting more. (EG, if you both use CDT and the “logical time” is “self modify?” --> “credibly commit?” --> “use nerf?” .) However, if the bandits realize this—i.e., if the “logical time” is “use nerf?” --> “self modify?” --> “credibly commit?”—then the bandits will want to not use nerf guns, forcing you to not self-modify. And if you each think that you are “logically before” the other party, you will make incompatible comitments (use regular guns & self-modify to resist) and people get shot with regular guns.
So, I agree that credible unilateral commitments can be useful and they can lead to guaranteed Pareto improvements. It’s just that I don’t think it addresses my main objection against the proposal.

So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.
Yup, I fully agree.
- Rohin Shah 5 Sep 2021 6:47 UTC
  LW: 8 AF: 5
  AF Parent
  Then you are incentivized to self-modify to start resisting more.
  The whole point is to commit not to do this (and then actually not do it)!
  It seems like you are treating “me” (the caravan) as unable to make commitments and always taking local improvements; I don’t know why you would assume this. Yes, there are some AI designs which would self-modify to resist more (e.g. CDT with a certain logical time); seems like you should conclude those AI designs are bad. If you were going to use SPI you’d build AIs that don’t do this self-modification once they’ve decided to use SPI.
  Do you also have this objection to one-boxing in Newcomb’s problem? (Since once the boxes are filled you have an incentive to self-modify into a two-boxer?)
  - VojtaKovarik 5 Sep 2021 17:08 UTC
    LW: 4 AF: 4
    AF Parent
    I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:
    (1) I am not claiming that SPI will never work as intended (ie, get adopted, don’t change players’ strategies, don’t change players’ “meta strategies”). Rather, I am saying it will work in some settings and fail to work in others.
    (2) Whether SPI works-as-intended depends on the larger context it appears in. (Some examples of settings are described in the “Embedding in a Larger Setting” section.) Importantly, this is much more about which real-world setting you apply SPI to than about the prediction being too sensitive to how you formalize things.
    (3) Because of (2), I think this is a really tricky topic to talk about informally. I think it might be best to separately ask (a) Given some specific formalization of the meta-setting, what will the introduction of SPI do? and (b) Is it formalizing a reasonable real-world situation (and is the formalization appropriate)?
    (4) An IMO imporant real-world situation is when you—a human or an institution—employ AI (or multiple AIs) which repeatedly interact on your behalf, in a world where bunch of other humans or institutions are doing the same. The details really matter here but I personally expect this to behave similarly to the “Evolutionary dynamics” example described in the report. Informally speaking, once SPI is getting widely adopted, you will have incentives to modify your AIs to be more aggressive / learn towards using AIs that are more aggresive. And even if you resist those incentives, you/your institution will get outcompeted by those who use more aggressive AIs. This will result in SPI not working as intended.
    I originally thought the section “Illustrating our Main Objection: Unrealistic Framing” should suffice to explain all this, but apparently it isn’t as clear as I thought it was. Nevertheless, it might perhaps be helpful to read it with this example in mind?
    - Rohin Shah 6 Sep 2021 7:56 UTC
      LW: 4 AF: 3
      AF Parent
      I agree with (1) and (2), in the same way that I would agree that “one boxing will work in some settings and fail to work in others” and “whether you should one box depends on the larger context it appears in”. I’d find it weird to call this an “objection” to one boxing though.
      I don’t really agree with the emphasis on formalization in (3), but I agree it makes sense to consider specific real-world situations.
      (4) is helpful, I wasn’t thinking of this situation. (I was thinking about cases in which you have a single agent AGI with a DSA that is tasked with pursuing humanity’s values, which then may have to acausally coordinate with other civilizations in the multiverse, since that’s the situation in which I’ve seen surrogate goals proposed.)
      I think a weird aspect of the situation in your “evolutionary dynamics” scenario is that the bandits cannot distinguish between caravans that use SPI and ones that don’t. I agree that if this is the case then you will often not see an advantage from SPI, because the bandits will need to take a strategy that is robust across whether or not the caravan is using SPI or not (and this is why you get outcompeted by those who use more aggressive AIs). However, I think that in the situation you describe in (4) it is totally possible for you to build a reputation as “that one institution (caravan) that uses SPI”, allowing potential “bandits” to threaten you with “Nerf guns”, while threatening other institutions with “real guns”. In that case, the bandits’ best response is to use Nerf guns against you and real guns against other institutions, and you will outcompete the other institutions. (Assuming the bandits still threaten everyone at a close-to-equal rate. You probably don’t want to make it too easy for them to threaten you.)
      Overall I think if your point is “applying SPI in real-world scenarios is tricky” I am in full agreement, but I wouldn’t call this an “objection”.
      - VojtaKovarik 6 Sep 2021 18:39 UTC
        LW: 1 AF: 1
        AF Parent
        I agree with (1) and (2), in the same way that I would agree that “one boxing will work in some settings and fail to work in others” and “whether you should one box depends on the larger context it appears in”. I’d find it weird to call this an “objection” to one boxing though.
        I agree that (1)+(2) isn’t significant enough to qualify as “an objection”. I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2′) below. And that seems like an objection to me.
        (2′) Whether or not it works-as-intended depends on the larger setting and there are many settings—more than you might initially think—where SPI will not work-as-intended.
        The reason I think (4) is a big deal is that I don’t think it relies on you being unable to distinguish the SPI and non-SPI caravans. What exactly do I mean by this? I view the caravans as having two parameters: (A) do they agree to using SPI? and, independently of this, (B) if bandits ambush & threaten them (using SPI or not), do they resist or do they give in? When I talk about incentives to be more agressive, I meant (B), not (A). That is, in the evolutionary setting, “you” (as the caravan-dispatcher / caravan-parameter-setter) will always want to tell the caravans to use SPI. But if most of the bandits also use SPI, you will want to set the (B) parameter to “always resist”.
        
        I would say that (4) relies on the bandits not being able to distinguish whether your caravan uses a different (B)-parameter from the one it would use in a world where nobody invented SPI. But this assumption seems pretty realistic? (At least if humans are involved. I agree this might be less of an issue in the AGI-with-DSA scenario.)
        Rohin Shah 6 Sep 2021 19:26 UTC
        LW: 3 AF: 3
        AF Parent
        But if most of the bandits also use SPI, you will want to set the (B) parameter to “always resist”.
        Can you explain this? I’m currently parsing it as “If most of the bandits would demand unarmed conditional on the caravan having committed to treating demand unarmed as they would have treated demand armed, then you want to make the caravan always choose resist”, but that seems false so I suspect I have misunderstood you.
        VojtaKovarik 7 Sep 2021 11:31 UTC
        LW: 3 AF: 3
        AF Parent
        That is—I think* -- a correct way to parse it. But I don’t think it false… uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).
        You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in relative frequency. And you have a commited caravan. If you instruct it to always resist, you get payoff 9(1-Ɛ) − 2Ɛ (using the payoff matrix from “G’; extension of G”). If you instruct it to always give in, you get payoff 4(1- Ɛ ) + 3Ɛ. So it is better to instruct it to always resist.
        *The only issue that comes to mind is my [responding to demand unarmed the same as it responds to demand armed] vs your [treating demand unarmed as they would have treated demand armed]? If you think there is a difference between the two, then I endorse the former and I am confused about what the latter would mean.
        Rohin Shah 7 Sep 2021 23:01 UTC
        LW: 6 AF: 5
        AF Parent
        Maybe the difference is between “respond” and “treat”.
        It seems like you are saying that the caravan commits to “[responding to demand unarmed the same as it responds to demand armed]” but doesn’t commit to “[and I will choose my probability of resisting based on the equilibrium value in the world where you always demand armed if you demand at all]”. In contrast, I include that as part of the commitment.
        When I say “[treating demand unarmed the same as it treats demand armed]” I mean “you first determine the correct policy to follow if there was only demand armed, then you follow that policy even if the bandits demand unarmed”; you cannot then later “instruct it to always resist” under this formulation (otherwise you have changed the policy and are not using SPI-as-I’m-thinking-of-it).
        I think the bandits should only accept commitments of the second type as a reason to demand unarmed, and not accept commitments of the first type, precisely because with commitments of the first type the bandits will be worse off if they demand unarmed than if they had demanded armed. That’s why I’m primarily thinking of the second type.
        VojtaKovarik 8 Sep 2021 16:53 UTC
        LW: 4 AF: 4
        AF Parent
        Perfect, that is indeed the diffeence. I agree with all of what you write here.
        In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option—once SPI is in use—the counterfactual world where there is only demand armed just seems so different. Wouldn’t history need to go very differently? Perhaps it wouldn’t even be clear what “you” is in that world?)
        But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)
        Rohin Shah 9 Sep 2021 5:54 UTC
        LW: 3 AF: 3
        AF Parent
        Yeah, I agree with all of that.