I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment. Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them. I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment.
Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them.
I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?