Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.
Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)=∑i|QRi(s,a)−QRi(s,∅)| from Conservative Agency, if the agent wants to achieve the sub-goal Ri while avoiding the penalty triggered by the Ri term, it can build a sub-agent that is slightly worse at achieving Ri than it it would be itself, and set it loose.
Now for some more speculative thoughts. I think the main source of the loophole above is the part QRi(s,∅), so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal Ri, which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.
Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?
Obviously, this depends on many factors, including Carl’s age. To manage the real world, we weave a quite complex web to determine accountability.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
Then we get an agent with an incentive to stop any human present in the environment from becoming too good
No, this modification stops people from actually optimizing Rif the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.
Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.
Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)=∑i|QRi(s,a)−QRi(s,∅)| from Conservative Agency, if the agent wants to achieve the sub-goal Ri while avoiding the penalty triggered by the Ri term, it can build a sub-agent that is slightly worse at achieving Ri than it it would be itself, and set it loose.
Now for some more speculative thoughts. I think the main source of the loophole above is the part QRi(s,∅), so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal Ri, which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.
Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?
Obviously, this depends on many factors, including Carl’s age. To manage the real world, we weave a quite complex web to determine accountability.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
No, this modification stops people from actually optimizing R if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.
I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.