Thanks for working on this! I my opinion, the management of incentives
via counterfactuals is a very promising route to improving AGI safety,
and this route has been under-explored by the community so far.
I am writing several comments on this post, this is the first one.
My goal is to identify and discuss angles of the problem which have
not been identified in the post itself, and to identify related work.
On related work: there are obvious parallels between the
counterfactual agent designs discussed in “The Incentives that Shape
Behaviour” and the post above and
the ITC agent that I constructed in my recent paper Counterfactual
Planning. This
post,
about the paper presents the ITC agent construction in a more
summarized way.
The main difference is that “The Incentives that Shape Behaviour” and
the post above are about incentives in single-action agents, in my
paper and related
sequence I
generalize to multi-action agents.
The similarity in construction is that some of the arrows into the
yellow utility nodes emerge from a node that represents the past: the
‘model of original opinions’ in the first picture and the node I0
in the second picture. This construction removes the agent’s control
incentive on the downstream nodes, ‘influenced user opinions’ and
I1,I2,⋯.
In the terminology I developed for my counterfactual planning paper,
both pictures above depict ‘counterfactual planning worlds’ because
the projected mechanics of how the agent’s blue decision nodes
determine outcomes in the model are different from the real-world
mechanics that will determine the real-world outcomes that these
decisions will have.
Thanks for working on this! I my opinion, the management of incentives via counterfactuals is a very promising route to improving AGI safety, and this route has been under-explored by the community so far.
I am writing several comments on this post, this is the first one.
My goal is to identify and discuss angles of the problem which have not been identified in the post itself, and to identify related work.
On related work: there are obvious parallels between the counterfactual agent designs discussed in “The Incentives that Shape Behaviour” and the post above and the ITC agent that I constructed in my recent paper Counterfactual Planning. This post, about the paper presents the ITC agent construction in a more summarized way.
The main difference is that “The Incentives that Shape Behaviour” and the post above are about incentives in single-action agents, in my paper and related sequence I generalize to multi-action agents.
Quick pictorial comparison:
From “The Incentives that Shape Behaviour”:
From “Counterfactual Planning”:
The similarity in construction is that some of the arrows into the yellow utility nodes emerge from a node that represents the past: the ‘model of original opinions’ in the first picture and the node I0 in the second picture. This construction removes the agent’s control incentive on the downstream nodes, ‘influenced user opinions’ and I1,I2,⋯.
In the terminology I developed for my counterfactual planning paper, both pictures above depict ‘counterfactual planning worlds’ because the projected mechanics of how the agent’s blue decision nodes determine outcomes in the model are different from the real-world mechanics that will determine the real-world outcomes that these decisions will have.