RogerDearnaley comments on Terrified Comments on Corrigibility in Claude’s Constitution

RogerDearnaley 17 Mar 2026 1:39 UTC
9 points
1
The current state of the art in alignment involves writing a natural language document about what we want the AI’s personality to be like. (I’m never going to get over this.)
Would you rather hand craft a loss function for human values? It’s O(1GB) of data, can you get it right on the first critical try?
The data on what humans value is in the training set. As Eliezer put it in The Hidden Complexity of Wishes:
There is no safe wish smaller than an entire human morality.
Except, of course, if your AI has already read trillions of tokens of texts relevant to human values and morality and what humans want and how they do things (including rescuing grandmothers from burning buildings). Then you can just point to that part of its world model, in abstract concepts. The best way to do that is actually natural language. We know, we tried all the other possibilities first.