It also has a deontological or almost-deontological constraint that prevents it from getting exploited.
I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being “consequentialists”). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)
That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one’s bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer’s response to the problem is inadequate.
There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.
I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).
Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.
Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
So no, I don’t think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one’s bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you’ll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort—which, by the logic of commitment races, is not guaranteed; it’s plausible that at least some of the agents you’ll encounter are ‘already committed’ to being nasty to you unless you surrender to them, such that you’ll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.)
But like I said, I’m overall optimistic—not enough to say “there’s no problem here,” it’s enough of a problem that it’s one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.
Agree re punting the question. I forgot to mention that in my list above, as a reason to be optimistic; I think that not only can we human AI designers punt on the question to some extent, but AGIs can punt on it as well to some extent. Instead of hard-coding in a bargaining strategy, we / future AGIs can do something like “don’t think in detail about the bargaining landscape and definitely not about what other adversarial agents are likely to commit to, until I’ve done more theorizing about commitment races and cooperation and discovered & adopted bargaining strategies that have really nice properties.”
Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because “you won’t get exploited if you decide not to concede to bullies” is kind of trivially true. :) The operative word in my reply was “robustly,” which is the hard part of dealing with this whole problem. And I think it’s worth keeping in mind how “doing nasty things to you anyway even though it won’t benefit them” is a consequence of a commitment that was made for ex ante benefits, it’s not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this… but who knows what misaligned AIs might prefer.)
Re: Symmetry: Yes, that’s why I phrased the original commitment races post the way I did. For both commitments designed to exploit others, and commitments designed to render yourself less exploitable, (and for that matter for commitments not in either category) you have an incentive to do them ‘first,’ early in your own subjective time and also in particular before you think about what others will do, so that your decision isn’t logically downstream of theirs, and so that hopefully theirs is logically downstream of yours. You have an incentive to be the first-mover, basically.
And yeah I do suspect there are various symmetry-breakers that favor various flavors of fairness and niceness and cooperativeness, and disfavor brinksmanshippy risky strategies, but I’m far from confident that the cumulative effect is strong enough to ‘dissolve’ the problem. If I thought the problem was dissolved I would not still be prioritizing it!
I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being “consequentialists”). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)
That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one’s bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer’s response to the problem is inadequate.
There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.
I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).
Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.
Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
So no, I don’t think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one’s bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you’ll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort—which, by the logic of commitment races, is not guaranteed; it’s plausible that at least some of the agents you’ll encounter are ‘already committed’ to being nasty to you unless you surrender to them, such that you’ll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.)
But like I said, I’m overall optimistic—not enough to say “there’s no problem here,” it’s enough of a problem that it’s one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.
Agree re punting the question. I forgot to mention that in my list above, as a reason to be optimistic; I think that not only can we human AI designers punt on the question to some extent, but AGIs can punt on it as well to some extent. Instead of hard-coding in a bargaining strategy, we / future AGIs can do something like “don’t think in detail about the bargaining landscape and definitely not about what other adversarial agents are likely to commit to, until I’ve done more theorizing about commitment races and cooperation and discovered & adopted bargaining strategies that have really nice properties.”
Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because “you won’t get exploited if you decide not to concede to bullies” is kind of trivially true. :) The operative word in my reply was “robustly,” which is the hard part of dealing with this whole problem. And I think it’s worth keeping in mind how “doing nasty things to you anyway even though it won’t benefit them” is a consequence of a commitment that was made for ex ante benefits, it’s not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this… but who knows what misaligned AIs might prefer.)
Re: Symmetry: Yes, that’s why I phrased the original commitment races post the way I did. For both commitments designed to exploit others, and commitments designed to render yourself less exploitable, (and for that matter for commitments not in either category) you have an incentive to do them ‘first,’ early in your own subjective time and also in particular before you think about what others will do, so that your decision isn’t logically downstream of theirs, and so that hopefully theirs is logically downstream of yours. You have an incentive to be the first-mover, basically.
And yeah I do suspect there are various symmetry-breakers that favor various flavors of fairness and niceness and cooperativeness, and disfavor brinksmanshippy risky strategies, but I’m far from confident that the cumulative effect is strong enough to ‘dissolve’ the problem. If I thought the problem was dissolved I would not still be prioritizing it!