Daniel Kokotajlo comments on The Commitment Races problem

Daniel Kokotajlo 14 Jul 2023 16:28 UTC
2 points
0
Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.

So no, I don’t think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one’s bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you’ll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort—which, by the logic of commitment races, is not guaranteed; it’s plausible that at least some of the agents you’ll encounter are ‘already committed’ to being nasty to you unless you surrender to them, such that you’ll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.)
But like I said, I’m overall optimistic—not enough to say “there’s no problem here,” it’s enough of a problem that it’s one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.

Agree re punting the question. I forgot to mention that in my list above, as a reason to be optimistic; I think that not only can we human AI designers punt on the question to some extent, but AGIs can punt on it as well to some extent. Instead of hard-coding in a bargaining strategy, we / future AGIs can do something like “don’t think in detail about the bargaining landscape and definitely not about what other adversarial agents are likely to commit to, until I’ve done more theorizing about commitment races and cooperation and discovered & adopted bargaining strategies that have really nice properties.”
- Anthony DiGiovanni 15 Jul 2023 19:52 UTC
  3 points
  0
  Parent
  Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
  Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because “you won’t get exploited if you decide not to concede to bullies” is kind of trivially true. :) The operative word in my reply was “robustly,” which is the hard part of dealing with this whole problem. And I think it’s worth keeping in mind how “doing nasty things to you anyway even though it won’t benefit them” is a consequence of a commitment that was made for ex ante benefits, it’s not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this… but who knows what misaligned AIs might prefer.)
- Daniel Kokotajlo 14 Jul 2023 16:39 UTC
  3 points
  0
  Parent
  Re: Symmetry: Yes, that’s why I phrased the original commitment races post the way I did. For both commitments designed to exploit others, and commitments designed to render yourself less exploitable, (and for that matter for commitments not in either category) you have an incentive to do them ‘first,’ early in your own subjective time and also in particular before you think about what others will do, so that your decision isn’t logically downstream of theirs, and so that hopefully theirs is logically downstream of yours. You have an incentive to be the first-mover, basically.
  
  And yeah I do suspect there are various symmetry-breakers that favor various flavors of fairness and niceness and cooperativeness, and disfavor brinksmanshippy risky strategies, but I’m far from confident that the cumulative effect is strong enough to ‘dissolve’ the problem. If I thought the problem was dissolved I would not still be prioritizing it!
  What links here?
  - Anthony DiGiovanni's comment on The Commitment Races problem by Daniel Kokotajlo (15 Jul 2023 19:52 UTC; 3 points)