Elad Hazan
Thanks for your interest!
1. The claim of the post is that alignment is a more general problem than assessing the moral character or internal ethics of an individual, and these constructs are a consequence of societal pressures / evolution. (and yes, that is what I believe :-). Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
2. Regarding the baselines: the baselines are all compute-matched, and include RLVR with most popular reward profile from the bandit algorithm, and RLVR with no auditor (solver-only training) with fixed 0⁄1 rewards.
AI Alignment as Equilibrium Design
Nice post! Another place where nonlinear utilities have been considered is in technical game theory / regret minimization literature. A consideration for introducing them is computational complexity—while VNM utility applies, it may be inefficient to linearize the utilities in terms of representation and downstream algorithms. An example is here, where a log-utility game is considered due to complexity/efficiency considerations.
A course I tought on AI safety and alignment at Princeton from 2 years ago:
https://sites.google.com/view/cos598aisafety/list of materials is on the webpage, happy to share lecture slides as well.
I believe Peter Henderson has a newer version. Roger Grosse taught a similar course at U Toronto.
I’m not claiming that other methods for alignment are not useful or important. But I am claiming there is a signal for post-training, that is strategic and based on interaction between agents, not used thus far, and important to alignment.
Maybe it’s useful to recall Becker’s model: from my perspective his viewpoint for modeling crime, law enforcement, the judicial system, does not contradict other viewpoints or detract from them.
I use the “fundamental question” to mean “a more general question” in the sense that it can be addressed using a broader set of tools, maybe that is the source of “disagreement” if there is one.
Weighted average reward profile is a good suggestion, we can add it to the experiments. I don’t think that will change our results, but worth a try. Notice that if it does perform well, our conclusion still stands, since how would you find this average without running the bandit algorithm? it would still validate the mechanism and approach, agreed?