Rohin Shah comments on Formalizing Objections against Surrogate Goals

Rohin Shah 6 Sep 2021 7:56 UTC
LW: 4 AF: 3
0
AF
I agree with (1) and (2), in the same way that I would agree that “one boxing will work in some settings and fail to work in others” and “whether you should one box depends on the larger context it appears in”. I’d find it weird to call this an “objection” to one boxing though.
I don’t really agree with the emphasis on formalization in (3), but I agree it makes sense to consider specific real-world situations.
(4) is helpful, I wasn’t thinking of this situation. (I was thinking about cases in which you have a single agent AGI with a DSA that is tasked with pursuing humanity’s values, which then may have to acausally coordinate with other civilizations in the multiverse, since that’s the situation in which I’ve seen surrogate goals proposed.)
I think a weird aspect of the situation in your “evolutionary dynamics” scenario is that the bandits cannot distinguish between caravans that use SPI and ones that don’t. I agree that if this is the case then you will often not see an advantage from SPI, because the bandits will need to take a strategy that is robust across whether or not the caravan is using SPI or not (and this is why you get outcompeted by those who use more aggressive AIs). However, I think that in the situation you describe in (4) it is totally possible for you to build a reputation as “that one institution (caravan) that uses SPI”, allowing potential “bandits” to threaten you with “Nerf guns”, while threatening other institutions with “real guns”. In that case, the bandits’ best response is to use Nerf guns against you and real guns against other institutions, and you will outcompete the other institutions. (Assuming the bandits still threaten everyone at a close-to-equal rate. You probably don’t want to make it too easy for them to threaten you.)
Overall I think if your point is “applying SPI in real-world scenarios is tricky” I am in full agreement, but I wouldn’t call this an “objection”.
- VojtaKovarik 6 Sep 2021 18:39 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I agree with (1) and (2), in the same way that I would agree that “one boxing will work in some settings and fail to work in others” and “whether you should one box depends on the larger context it appears in”. I’d find it weird to call this an “objection” to one boxing though.
  I agree that (1)+(2) isn’t significant enough to qualify as “an objection”. I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2′) below. And that seems like an objection to me.
  (2′) Whether or not it works-as-intended depends on the larger setting and there are many settings—more than you might initially think—where SPI will not work-as-intended.
  The reason I think (4) is a big deal is that I don’t think it relies on you being unable to distinguish the SPI and non-SPI caravans. What exactly do I mean by this? I view the caravans as having two parameters: (A) do they agree to using SPI? and, independently of this, (B) if bandits ambush & threaten them (using SPI or not), do they resist or do they give in? When I talk about incentives to be more agressive, I meant (B), not (A). That is, in the evolutionary setting, “you” (as the caravan-dispatcher / caravan-parameter-setter) will always want to tell the caravans to use SPI. But if most of the bandits also use SPI, you will want to set the (B) parameter to “always resist”.
  
  I would say that (4) relies on the bandits not being able to distinguish whether your caravan uses a different (B)-parameter from the one it would use in a world where nobody invented SPI. But this assumption seems pretty realistic? (At least if humans are involved. I agree this might be less of an issue in the AGI-with-DSA scenario.)
  - Rohin Shah 6 Sep 2021 19:26 UTC
    LW: 3 AF: 3
    0
    AF Parent
    But if most of the bandits also use SPI, you will want to set the (B) parameter to “always resist”.
    Can you explain this? I’m currently parsing it as “If most of the bandits would demand unarmed conditional on the caravan having committed to treating demand unarmed as they would have treated demand armed, then you want to make the caravan always choose resist”, but that seems false so I suspect I have misunderstood you.
    - VojtaKovarik 7 Sep 2021 11:31 UTC
      LW: 3 AF: 3
      0
      AF Parent
      That is—I think* -- a correct way to parse it. But I don’t think it false… uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).
      You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in relative frequency. And you have a commited caravan. If you instruct it to always resist, you get payoff 9(1-Ɛ) − 2Ɛ (using the payoff matrix from “G’; extension of G”). If you instruct it to always give in, you get payoff 4(1- Ɛ ) + 3Ɛ. So it is better to instruct it to always resist.
      *The only issue that comes to mind is my [responding to demand unarmed the same as it responds to demand armed] vs your [treating demand unarmed as they would have treated demand armed]? If you think there is a difference between the two, then I endorse the former and I am confused about what the latter would mean.
      - Rohin Shah 7 Sep 2021 23:01 UTC
        LW: 6 AF: 5
        0
        AF Parent
        Maybe the difference is between “respond” and “treat”.
        It seems like you are saying that the caravan commits to “[responding to demand unarmed the same as it responds to demand armed]” but doesn’t commit to “[and I will choose my probability of resisting based on the equilibrium value in the world where you always demand armed if you demand at all]”. In contrast, I include that as part of the commitment.
        When I say “[treating demand unarmed the same as it treats demand armed]” I mean “you first determine the correct policy to follow if there was only demand armed, then you follow that policy even if the bandits demand unarmed”; you cannot then later “instruct it to always resist” under this formulation (otherwise you have changed the policy and are not using SPI-as-I’m-thinking-of-it).
        I think the bandits should only accept commitments of the second type as a reason to demand unarmed, and not accept commitments of the first type, precisely because with commitments of the first type the bandits will be worse off if they demand unarmed than if they had demanded armed. That’s why I’m primarily thinking of the second type.
        VojtaKovarik 8 Sep 2021 16:53 UTC
        LW: 4 AF: 4
        0
        AF Parent
        Perfect, that is indeed the diffeence. I agree with all of what you write here.
        In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option—once SPI is in use—the counterfactual world where there is only demand armed just seems so different. Wouldn’t history need to go very differently? Perhaps it wouldn’t even be clear what “you” is in that world?)
        But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)
        Rohin Shah 9 Sep 2021 5:54 UTC
        LW: 3 AF: 3
        0
        AF Parent
        Yeah, I agree with all of that.