[Question] Training for corrigability: obvious problems?

I’ve been thinking about AI corrigibility lately and have come up with a potential solution that probably has been refuted, but I’m not aware of a refutation.

The solution I’m proposing is to condition both the actor and the critic on a goal-representing vector g, change it multiple times during training when the model is still weak, and add a baseline to the value function to ensure it doesn’t change when the goal is changed. In other words, we want the agent to not instrumentally-care about its goals. For example, if we switch the goal from maximizing paperclips to minimizing paperclips, the model would be trained to maximize the number of paperclips it would-have-produced, and punished during training for wasting efforts on controlling its goals. Sort of like when we play a game, and sometimes don’t care to stop it in the middle or change the rules in favor of the opponent (e.g. letting them go back and change moves), if the opponent admit that we would probably have won—because we get the same amount of prestige they expect to get it we continue playing. In such setups, we are not motivated to choose moves based on how likely they are to make the opponent want to continue/​stop.

I haven’t been able to identify any obvious flaws in it, and I’m curious to hear from the community if they know of any serious problems or can think of any. My best guess is that the path dependence created by the baselines may allow the model to “pump value” somehow—but I don’t see a specific mechanism that seem simpler or otherwise more likely to evolve than corrigibility.