Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment. Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them. I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment.
Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them.
I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?