X-AI-Phi Lab, Japan Advanced Institute of Science and Technology
Masaharu Mizumoto
Thanks, I appreciate the thoughtful comment and the reference.
If there are real concerns for this project, I would place them mainly at the level of implementation: how to collect moral intuitions in a systematic and reliable way, how much data is needed, and how to train models effectively on that basis. In that sense, if a model behaves poorly, that would point to shortcomings in training rather than a flaw in the overall direction.
The project you shared is certainly interesting as a piece of experimental philosophy of ethical analysis, but from an alignment perspective, it seems relatively indirect: using such data to actually build models that behave well would require several additional assumptions and steps.
That said, I would be interested to see how these approaches develop in practice.
The Goodhart problem typically arises when a proxy is optimized instead of the true objective, and in many alignment approaches, multiple explicit rules or objectives are set up. That is why optimizing them can lead to problematic behavior. But in our case, the scalar virtuosity score is not meant to be a proxy for something else, but an approximation of holistic human judgments about virtuous behavior.
If optimizing this signal led to non-virtuous behavior, that would just indicate a failure of the signal, rather than a standard Goodhart problem.
By the way, I’m not the sole person but we are a team of philosophers and engineers.