Koen.Holtman comments on Subagents and impact measures, full and fully illustrated

Koen.Holtman 28 Feb 2020 11:00 UTC
LW: 3 AF: 2
AF
Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.
Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)= $\sum_{i} | Q_{R_{i}} (s, a) - Q_{R_{i}} (s, \emptyset) |$ from Conservative Agency, if the agent wants to achieve the sub-goal $R_{i}$ while avoiding the penalty triggered by the $R_{i}$ term, it can build a sub-agent that is slightly worse at achieving $R_{i}$ than it it would be itself, and set it loose.
Now for some more speculative thoughts. I think the main source of the loophole above is the part $Q_{R_{i}} (s, \emptyset)$ , so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal $R_{i}$ , which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.
Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?
Obviously, this depends on many factors, including Carl’s age. To manage the real world, we weave a quite complex web to determine accountability.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
- TurnTrout 28 Feb 2020 15:18 UTC
  LW: 2 AF: 1
  AF Parent
  
  Then we get an agent with an incentive to stop any human present in the environment from becoming too good
  
  No, this modification stops people from actually optimizing $R$ if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.
  
  In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
  
  I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.