the key trick is simply to maximize the sum of mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) as the only reward function for a small model.
in my current view, [spoilers for my judgement so you can form your own if you like,]
it’s one of the most promising corrigibility papers I’ve seen. It still has some problems from the perspective of corrigibility, but it seems to have the interesting effect of making a reinforcement learner desperately want to be instructed. there are probably still very severe catastrophic failures hiding in slivers of plan space that would make a mimi superplanner dangerous (eg, at approximately human levels of optimization, it might try to force you to spend time with it and give it instructions?), and I don’t think it works to train a model any other way than from scratch, and it doesn’t solve multi-agent, and it doesn’t solve interpretability (though it might combine really well with runtime interpretability visualizations, doubly so if the visualizations are mechanistically exact), so over-optimization would still break it, and it only works when a user is actively controlling it—but it seems much less prone to failure than vanilla rlhf, because it produces an agent that, if I understand correctly, stops moving when the user stops moving (to first approximation).
it seems to satisfy the mathematical simplicity you’re asking for. I’m likely going to attempt follow-up research—I want to figure out if there’s a way to do something similar with much bigger models, ie 1m to 6b parameter range. and I want to see how weird the behavior is when you try to steer a pretrained model with it. a friend is advising me casually on the project.
It seems to me that there are some key desiderata for corrigibility that it doesn’t satisfy, in particular it isn’t terribly desperate to explain itself to you, it just wants your commands to seem to be at sensible times that cause you to have control of its action timings in order to control the environment. but it makes your feedback much denser through the training process and produces a model that, if I understand correctly, gets bored without instruction. also it seems like with some tweaking it might also be a good model of what makes human relationship satisfying, which is a key tell I look for.
That would be a very poetic way to die: an AI desperately pulling every bit of info it can out of a human, and dumping that info into the environment. They do say that humanity’s death becomes more gruesome and dystopian the closer the proposal is to working, and that does sound decidedly gruesome and dystopian.
Anyway, more concretely, the problem which jumps out to me is that maximizing mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) just means that all the info from the command routes through the action and into the environment in some way; the semantics or intent of the command need not have anything at all to do with the resulting environmental change. Like, maybe there’s a prompt on my screen which says “would you like the lightswitch on (1) or off (0)?”, and I enter “1″, and then the AI responds by placing a coin heads-side-up. There’s no requirement that my one bit actually needs to be encoded into the environment in a way which has anything to do with the lightswitch.
When I sent him the link to this comment, he replied:
ah i think you forgot the first term in the MIMI objective, I(s_t, x_t), which makes the mapping intuitive by maximizing information flow from the environment into the user. what you proposed was similar to optimizing only the second term, I(x_t, s_t+1 | s_t), which would indeed suffer from the problems that john mentions in his reply
are you familiar with mutual information maximizing interfaces?
the key trick is simply to maximize the sum of mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) as the only reward function for a small model.
in my current view, [spoilers for my judgement so you can form your own if you like,]
it’s one of the most promising corrigibility papers I’ve seen. It still has some problems from the perspective of corrigibility, but it seems to have the interesting effect of making a reinforcement learner desperately want to be instructed. there are probably still very severe catastrophic failures hiding in slivers of plan space that would make a mimi superplanner dangerous (eg, at approximately human levels of optimization, it might try to force you to spend time with it and give it instructions?), and I don’t think it works to train a model any other way than from scratch, and it doesn’t solve multi-agent, and it doesn’t solve interpretability (though it might combine really well with runtime interpretability visualizations, doubly so if the visualizations are mechanistically exact), so over-optimization would still break it, and it only works when a user is actively controlling it—but it seems much less prone to failure than vanilla rlhf, because it produces an agent that, if I understand correctly, stops moving when the user stops moving (to first approximation).
it seems to satisfy the mathematical simplicity you’re asking for. I’m likely going to attempt follow-up research—I want to figure out if there’s a way to do something similar with much bigger models, ie 1m to 6b parameter range. and I want to see how weird the behavior is when you try to steer a pretrained model with it. a friend is advising me casually on the project.
It seems to me that there are some key desiderata for corrigibility that it doesn’t satisfy, in particular it isn’t terribly desperate to explain itself to you, it just wants your commands to seem to be at sensible times that cause you to have control of its action timings in order to control the environment. but it makes your feedback much denser through the training process and produces a model that, if I understand correctly, gets bored without instruction. also it seems like with some tweaking it might also be a good model of what makes human relationship satisfying, which is a key tell I look for.
very curious to hear your thoughts.
That would be a very poetic way to die: an AI desperately pulling every bit of info it can out of a human, and dumping that info into the environment. They do say that humanity’s death becomes more gruesome and dystopian the closer the proposal is to working, and that does sound decidedly gruesome and dystopian.
Anyway, more concretely, the problem which jumps out to me is that maximizing mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) just means that all the info from the command routes through the action and into the environment in some way; the semantics or intent of the command need not have anything at all to do with the resulting environmental change. Like, maybe there’s a prompt on my screen which says “would you like the lightswitch on (1) or off (0)?”, and I enter “1″, and then the AI responds by placing a coin heads-side-up. There’s no requirement that my one bit actually needs to be encoded into the environment in a way which has anything to do with the lightswitch.
When I sent him the link to this comment, he replied:
my imprecision may have mislead you :)