This seems like an important issue, but given the example, I’m also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be “obvious” and so a patch would be obviously needed. This seems unclear—we need to understand the strategic motives to diagnose what is happening, and given that we don’t have good ideas for explainability, I’m unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don’t currently have good ideas.)
I don’t think this is a Goodhart-style effect. Standard indifference is a very carefully constructed effect, and it does exactly what it is designed for: making the agents indifferent to their individual interruptions. It turns out this doesn’t make them indifferent to the interruptions of other agents, which is annoying but not really surprising.
It’s not Goodhart, it’s just that mutual indifference has to be specifically designed for.
The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we’ve assumed away a number of key problems—among the most critical of which seem to be:
We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward—Abram Demski has noted in another post that this is very problematic.
We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent’s indifference. (As an aside, I’m unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we’re computing.) If we can compute another agent’s reward function when designing our agent, however, we can plausibly hijack that agent.
We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That’s a critical issue I’m still trying to wrap my head around, because it’s unclear to me how a system can reason in those cases.
This seems like an important issue, but given the example, I’m also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be “obvious” and so a patch would be obviously needed. This seems unclear—we need to understand the strategic motives to diagnose what is happening, and given that we don’t have good ideas for explainability, I’m unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don’t currently have good ideas.)
I don’t think this is a Goodhart-style effect. Standard indifference is a very carefully constructed effect, and it does exactly what it is designed for: making the agents indifferent to their individual interruptions. It turns out this doesn’t make them indifferent to the interruptions of other agents, which is annoying but not really surprising.
It’s not Goodhart, it’s just that mutual indifference has to be specifically designed for.
The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
? Doesn’t the design above do that?
Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we’ve assumed away a number of key problems—among the most critical of which seem to be:
We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward—Abram Demski has noted in another post that this is very problematic.
We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent’s indifference. (As an aside, I’m unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we’re computing.) If we can compute another agent’s reward function when designing our agent, however, we can plausibly hijack that agent.
We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That’s a critical issue I’m still trying to wrap my head around, because it’s unclear to me how a system can reason in those cases.