Davidmanheim comments on Indifference: multiple changes, multiple agents

Davidmanheim 10 Jul 2019 8:09 UTC
LW: 1 AF: 1
0
AF
This seems like an important issue, but given the example, I’m also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be “obvious” and so a patch would be obviously needed. This seems unclear—we need to understand the strategic motives to diagnose what is happening, and given that we don’t have good ideas for explainability, I’m unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don’t currently have good ideas.)
- Stuart_Armstrong 10 Jul 2019 11:09 UTC
  LW: 7 AF: 3
  0
  AF Parent
  I don’t think this is a Goodhart-style effect. Standard indifference is a very carefully constructed effect, and it does exactly what it is designed for: making the agents indifferent to their individual interruptions. It turns out this doesn’t make them indifferent to the interruptions of other agents, which is annoying but not really surprising.
  
  It’s not Goodhart, it’s just that mutual indifference has to be specifically designed for.
  - Davidmanheim 14 Jul 2019 11:50 UTC
    LW: 1 AF: 1
    0
    AF Parent
    The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
    - Stuart_Armstrong 14 Jul 2019 22:14 UTC
      LW: 3 AF: 2
      0
      AF Parent
      
      but if we have a solution that fixes the way they exploit interruption
      
      ? Doesn’t the design above do that?
      - Davidmanheim 15 Jul 2019 15:29 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we’ve assumed away a number of key problems—among the most critical of which seem to be:
        We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward—Abram Demski has noted in another post that this is very problematic.
        We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent’s indifference. (As an aside, I’m unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we’re computing.) If we can compute another agent’s reward function when designing our agent, however, we can plausibly hijack that agent.
        We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That’s a critical issue I’m still trying to wrap my head around, because it’s unclear to me how a system can reason in those cases.