I think this is an idea worth exploring. The biggest problem I have with it right now is that it seems like current ML methods would get us mesa-optimizers.
To spell it out a bit: At first the policy would be a jumble of heuristics that does decently well. Eventually, though, it would have to be something more like an agent, to mimic humans. But the first agent that forms wouldn’t also be the last, the perfectly accurate one. Rather, it would be somewhat accurate. Thenceforth further training could operate on the AIs values and heuristics to make it more human-like… OR it could operate on the AIs values and heuristics to make it more rational and smart so that it can predict and then mimic human behavior better. And the latter seems more likely to me.
So what we’d end up with is something that is similar to a human, except with values that are a more random and alien, and maybe also more rational and smart. This seems like exactly the sort of thing we are trying to avoid.
Maybe you mean the methods you expect we will use? I don’t think current ML methods make mesa-optimizers.
I know you’re not disputing this, but I think it’s worth having this formal result in the background: for a maximum a posteriori predictor that assigns positive prior probability to the truth, for all ε, for sufficiently large T, the predictor will be within ε of the truth when assigning probability to all events (even regarding events well into the future). But yes, I think mesa-optimizers are something to keep in mind, especially if we use good heuristics to pick a model to see if it is maximum a posteriori (since in reality, we wouldn’t be comparing all possible models).
Side note: I was just thinking about what a mesa-optimizer designed to be robust to gradient updates might look like. Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior? For the decision at every timestep between “blend in” and “treacherous turn”, it seems like gradient updates would shift its probability of toward “blend in”. Could it avoid this?
Also, compared to my fears about other areas of alignment, I feel pretty decent about the possibility of weeding out mesa-optimizers by biasing toward fast or memory-lite functions.
I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn’t they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?
For your note, I’m not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change—I believe the mesa-alignment paper calls this “value crystallization.” The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I’m not sure I understand the example
I wasn’t trying to suggest the answer to
Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior?
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.
I think this is an idea worth exploring. The biggest problem I have with it right now is that it seems like current ML methods would get us mesa-optimizers.
To spell it out a bit: At first the policy would be a jumble of heuristics that does decently well. Eventually, though, it would have to be something more like an agent, to mimic humans. But the first agent that forms wouldn’t also be the last, the perfectly accurate one. Rather, it would be somewhat accurate. Thenceforth further training could operate on the AIs values and heuristics to make it more human-like… OR it could operate on the AIs values and heuristics to make it more rational and smart so that it can predict and then mimic human behavior better. And the latter seems more likely to me.
So what we’d end up with is something that is similar to a human, except with values that are a more random and alien, and maybe also more rational and smart. This seems like exactly the sort of thing we are trying to avoid.
Maybe you mean the methods you expect we will use? I don’t think current ML methods make mesa-optimizers.
I know you’re not disputing this, but I think it’s worth having this formal result in the background: for a maximum a posteriori predictor that assigns positive prior probability to the truth, for all ε, for sufficiently large T, the predictor will be within ε of the truth when assigning probability to all events (even regarding events well into the future). But yes, I think mesa-optimizers are something to keep in mind, especially if we use good heuristics to pick a model to see if it is maximum a posteriori (since in reality, we wouldn’t be comparing all possible models).
Side note: I was just thinking about what a mesa-optimizer designed to be robust to gradient updates might look like. Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior? For the decision at every timestep between “blend in” and “treacherous turn”, it seems like gradient updates would shift its probability of toward “blend in”. Could it avoid this?
Also, compared to my fears about other areas of alignment, I feel pretty decent about the possibility of weeding out mesa-optimizers by biasing toward fast or memory-lite functions.
I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn’t they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?
For your note, I’m not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change—I believe the mesa-alignment paper calls this “value crystallization.” The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I wasn’t trying to suggest the answer to
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.