The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purpose of this proposal is not to become more resistant to threats. Rather, we hope that if the agent and the threatener misjudge each other (underestimating the commitment to ignore/carry out the threat), the outcome (Ignore threat, Carry out threat) will be replaced by something harmless.
Safe Pareto improvements (SPIs) [OC] is a formalization and a generalization of this idea. In the straightforward interpretation, the approach applies to a situation where an agent delegates a problem to their representative—but we could also imagine “delegating” to a future self-modified version of oneself. The idea is to give instructions like “if other agents you encounter also have this same instruction, replace ‘real’ conflicts with them with ‘mock’ conflicts, in which you all act like you would in a real conflict, but bad outcomes are replaced by less harmful variants”. Some potential examples would be fighting (should a fight arise) to the first blood instead of to the death or threatening diplomatic conflict instead of a military one. In the SG setting, SPI could correspond to a joint commitment to only use water-gun threats while behaving as if the water gun was real.
First, I have one “methodological” observation: If we are to understand whether these approaches “work”, it is not sufficient to look just at the threat situation itself. Instead, we need to embed it in a larger context. (Is the interaction one-time-only, or will there be repetitions? Did the agents have a chance to update on the fact that SG/SPI are being used? What do the agents know about each other?) In the report, I show that different settings can lead to (very) different outcomes.
Incidentally, I think this observation applies to most technical research. We should specify the larger context, at least informally, such that we can answer questions like “but will it help?” and “wouldn’t this other approach be even better?”.
Second, SG and SPI seem to have one limitation that seems hard to overcome. Both SG and SPI hope to make agents treat conflict as seriously as before while in-fact making conflict outcomes less costly. However, if SG/SPI starts “being a thing”, agents will have incentives to treat conflict as less costly. (EG, ignoring threats they might take seriously before or making threats they wouldn’t dare to make otherwise.) The report gives an example of an “evolutionary” setting that illustrates this well, imo.
Finally: (1) I believe that the hypothesis that “we ‘only’ need to work out the details of SG/SPI, and then we will avoid the realization of most threats” is false. (2) Instead, I would expect that the approach “adds up to normality”—that with SG/SPI, we can do mostly the same things that we could do with “just” the ability to make contracts and being somewhat transparent to each other. (3) However, I still think that studying the approach is useful—it is a way of formalizing things, it will work in some settings, and even when it doesn’t, we might get ideas for which other things could work instead.
the tl;dr version of the full report:
The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purpose of this proposal is not to become more resistant to threats. Rather, we hope that if the agent and the threatener misjudge each other (underestimating the commitment to ignore/carry out the threat), the outcome (Ignore threat, Carry out threat) will be replaced by something harmless.
Safe Pareto improvements (SPIs) [OC] is a formalization and a generalization of this idea. In the straightforward interpretation, the approach applies to a situation where an agent delegates a problem to their representative—but we could also imagine “delegating” to a future self-modified version of oneself. The idea is to give instructions like “if other agents you encounter also have this same instruction, replace ‘real’ conflicts with them with ‘mock’ conflicts, in which you all act like you would in a real conflict, but bad outcomes are replaced by less harmful variants”. Some potential examples would be fighting (should a fight arise) to the first blood instead of to the death or threatening diplomatic conflict instead of a military one. In the SG setting, SPI could correspond to a joint commitment to only use water-gun threats while behaving as if the water gun was real.
First, I have one “methodological” observation: If we are to understand whether these approaches “work”, it is not sufficient to look just at the threat situation itself. Instead, we need to embed it in a larger context. (Is the interaction one-time-only, or will there be repetitions? Did the agents have a chance to update on the fact that SG/SPI are being used? What do the agents know about each other?) In the report, I show that different settings can lead to (very) different outcomes.
Incidentally, I think this observation applies to most technical research. We should specify the larger context, at least informally, such that we can answer questions like “but will it help?” and “wouldn’t this other approach be even better?”.
Second, SG and SPI seem to have one limitation that seems hard to overcome. Both SG and SPI hope to make agents treat conflict as seriously as before while in-fact making conflict outcomes less costly. However, if SG/SPI starts “being a thing”, agents will have incentives to treat conflict as less costly. (EG, ignoring threats they might take seriously before or making threats they wouldn’t dare to make otherwise.) The report gives an example of an “evolutionary” setting that illustrates this well, imo.
Finally: (1) I believe that the hypothesis that “we ‘only’ need to work out the details of SG/SPI, and then we will avoid the realization of most threats” is false. (2) Instead, I would expect that the approach “adds up to normality”—that with SG/SPI, we can do mostly the same things that we could do with “just” the ability to make contracts and being somewhat transparent to each other. (3) However, I still think that studying the approach is useful—it is a way of formalizing things, it will work in some settings, and even when it doesn’t, we might get ideas for which other things could work instead.