Much of the alignment literature starts with the question of what are “human values”, “ethical behavior”, or “morality”, and how we can get models to act in accordance with them. This is an important question, but we argue that it can obscure a more fundamental technical problem of AI alignment.
Do you actually believe that what you’re talking about here is “more fundamental?” I enjoyed the paper, but we have lots of alignment mechanisms that work well in the domain where we can assume a perfect (if moderately costly) resolution process grounding the whole effort. But if this is really more fundamental, then should we expect it to resolve the less fundamental problems as special cases?
I was interested by the remark about stability of equilibria—it would be super cool if you could test whether some good solver performance level is unstable (if you just left the auditor and solver reward fixed and kept training), but is stabilized by the controller.
Relatedly, I didn’t really understand your justification for why the baselines (particularly the fixed-default baseline) were expected to be strong. It sounded like you were saying it was doing well at the end of training (albeit only in the “ecosystem” created by the other parts?), but I’m not clear on how much that means I should have expected it to do well—in fact I’m not even clear on whether you estimate some effective parameters found by the controller and compare them to the default.
Thanks for your interest! 1. The claim of the post is that alignment is a more general problem than assessing the moral character or internal ethics of an individual, and these constructs are a consequence of societal pressures / evolution. (and yes, that is what I believe :-). Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go? 2. Regarding the baselines: the baselines are all compute-matched, and include RLVR with most popular reward profile from the bandit algorithm, and RLVR with no auditor (solver-only training) with fixed 0⁄1 rewards.
Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment. Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them. I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?
Do you actually believe that what you’re talking about here is “more fundamental?” I enjoyed the paper, but we have lots of alignment mechanisms that work well in the domain where we can assume a perfect (if moderately costly) resolution process grounding the whole effort. But if this is really more fundamental, then should we expect it to resolve the less fundamental problems as special cases?
I was interested by the remark about stability of equilibria—it would be super cool if you could test whether some good solver performance level is unstable (if you just left the auditor and solver reward fixed and kept training), but is stabilized by the controller.
Relatedly, I didn’t really understand your justification for why the baselines (particularly the fixed-default baseline) were expected to be strong. It sounded like you were saying it was doing well at the end of training (albeit only in the “ecosystem” created by the other parts?), but I’m not clear on how much that means I should have expected it to do well—in fact I’m not even clear on whether you estimate some effective parameters found by the controller and compare them to the default.
Thanks for your interest!
1. The claim of the post is that alignment is a more general problem than assessing the moral character or internal ethics of an individual, and these constructs are a consequence of societal pressures / evolution. (and yes, that is what I believe :-). Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
2. Regarding the baselines: the baselines are all compute-matched, and include RLVR with most popular reward profile from the bandit algorithm, and RLVR with no auditor (solver-only training) with fixed 0⁄1 rewards.
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment.
Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them.
I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?