Clément L comments on Utilitarian AI Alignment: Building a Moral Assistant with the Constitutional AI Method

Clément L 8 Mar 2025 17:28 UTC
1 point
0
I just got a feedback from the Bluedot Impact’s judges, and thought I could share some parts of it :
“Your idea is unique and pretty cool. Trying to align an AI assistant with strict utilitarian principles using the Constitutional AI (CAI) method is fascinating. This experiment gets at one of the big alignment questions—if we could perfectly encode a moral system into an AI, would we actually want to use it?
[...] (basically summarizing the points made in the project)
The prompts used in evaluation are good but fairly standard—what happens with more complex cases? Moral trade-offs involving large numbers of people (e.g., pandemic response strategies). Multi-step reasoning problems (e.g., economic policies that affect well-being over decades). Conflicts between utilitarianism and human emotions (e.g., should AI break a promise if it maximizes well-being?). Also consider investigating whether CAI is the best method for moral alignment. The experiment shows that CAI can encode utilitarianism into an AI, but is this actually the best approach? Would reinforcement learning from human feedback lead to more acceptable utilitarian AI models? Could a debate-style training method (where the AI argues against itself using different moral principles) produce better-aligned models?

This paper raises more questions than it answers, which is a good sign—it means there’s room for deeper exploration. To make this project even stronger, the next step would be to compare different moral alignment approaches, test stricter utilitarianism, and investigate instrumental convergence more directly. Overall, a fantastic and thought-provoking project!”