I'm studying the effects of an inconsistent comparison function on optimizing
with comparisons,
because I want to know whether it prevents the two agents from converging on
a desirable equilibrium quickly enough
in order to help my reader understand whether optimizing with
comparisons can solve the problem of inconsistency and unreliability in
reward engineering.
Can you explain this one a bit more? It seems to me that if the human is giving inconsistent answers, in the sense that the human says A > B and B > C and C > A, then the thing to do is to flag this and ask them to resolve the inconsistency instead of trying to find a way to work around it. Interpretability > Magic, I say.
I don’t think that would work in this case. I derived the project idea from Thoughts on reward engineering, section 2. There the overseer generates rewards based on its preferences and provides these rewards to RL agents.
Suppose the training starts with the overseer generating rewards from its preferences and the agents updating their value functions accordingly. After a while the agents propose something new and the overseer generates a reward that is inconsistent with those it has generated before. But it happens that this one is the true preference and the proper fix would be to revise the earlier rewards. However, rewarded is rewarded – I guess it would be hard to reverse the corresponding changes in the value functions.
Of course one could record all actions and rewards and snapshots of the value functions, then rewind and reapply with revised rewards. But given today’s model sizes and training volumes, it’s not that straightforward.
Can you explain this one a bit more? It seems to me that if the human is giving inconsistent answers, in the sense that the human says A > B and B > C and C > A, then the thing to do is to flag this and ask them to resolve the inconsistency instead of trying to find a way to work around it. Interpretability > Magic, I say.
I don’t think that would work in this case. I derived the project idea from Thoughts on reward engineering, section 2. There the overseer generates rewards based on its preferences and provides these rewards to RL agents.
Suppose the training starts with the overseer generating rewards from its preferences and the agents updating their value functions accordingly. After a while the agents propose something new and the overseer generates a reward that is inconsistent with those it has generated before. But it happens that this one is the true preference and the proper fix would be to revise the earlier rewards. However, rewarded is rewarded – I guess it would be hard to reverse the corresponding changes in the value functions.
Of course one could record all actions and rewards and snapshots of the value functions, then rewind and reapply with revised rewards. But given today’s model sizes and training volumes, it’s not that straightforward.