Koen.Holtman comments on Counterfactual control incentives

Koen.Holtman 25 Feb 2021 13:12 UTC
LW: 3 AF: 3
AF
Thanks for working on this! I my opinion, the management of incentives via counterfactuals is a very promising route to improving AGI safety, and this route has been under-explored by the community so far.

I am writing several comments on this post, this is the first one.

My goal is to identify and discuss angles of the problem which have not been identified in the post itself, and to identify related work.

On related work: there are obvious parallels between the counterfactual agent designs discussed in “The Incentives that Shape Behaviour” and the post above and the ITC agent that I constructed in my recent paper Counterfactual Planning. This post, about the paper presents the ITC agent construction in a more summarized way.

The main difference is that “The Incentives that Shape Behaviour” and the post above are about incentives in single-action agents, in my paper and related sequence I generalize to multi-action agents.

Quick pictorial comparison:

From “The Incentives that Shape Behaviour”:

From “Counterfactual Planning”:

The similarity in construction is that some of the arrows into the yellow utility nodes emerge from a node that represents the past: the ‘model of original opinions’ in the first picture and the node $I_{0}$ in the second picture. This construction removes the agent’s control incentive on the downstream nodes, ‘influenced user opinions’ and $I_{1}, I_{2}, \dots$ .

In the terminology I developed for my counterfactual planning paper, both pictures above depict ‘counterfactual planning worlds’ because the projected mechanics of how the agent’s blue decision nodes determine outcomes in the model are different from the real-world mechanics that will determine the real-world outcomes that these decisions will have.