But I think this misses one of the most important reasons we have rules: that we can debate and decide on them, and once we do so, we all follow the rules even if we do not agree with them.
But this clearly doesn’t scale to solving alignment, which would have to stay robust in a future scenario where humans are no longer in control. Because in that case we can’t keep tinkering with the rules.
As a side note, the non-reasoning version of ChatGPT still says that letting millions of people die is preferable to calling someone the n-word. The fact that I have never seen an example of Claude failing like that on an ethics question seems to be at least weak evidence that Anthropic’s principle-based approach to alignment is more robust OOD.
I disagree with the notion that even ASI means we need to cede control to the AIs.
Regarding your example, I believe the training stack of the two companies and the range of models is different enough that I would not attribute a particular difference- especially on one prompt- to a philosophical difference. FWIW I tried to replicate your question now in GPT5.2-instant and the model analyzed this in multiple frameworks, with most of them saying the man should say the n word:
Agreed narrowly, though like Haiku said this is a highly unreasonable hypothetical and it’s not at all clear to me that model answers on questions like this say much about their performance in real-world ethical dilemmas (or in general).
But this clearly doesn’t scale to solving alignment, which would have to stay robust in a future scenario where humans are no longer in control. Because in that case we can’t keep tinkering with the rules.
As a side note, the non-reasoning version of ChatGPT still says that letting millions of people die is preferable to calling someone the n-word. The fact that I have never seen an example of Claude failing like that on an ethics question seems to be at least weak evidence that Anthropic’s principle-based approach to alignment is more robust OOD.
I disagree with the notion that even ASI means we need to cede control to the AIs.
Regarding your example, I believe the training stack of the two companies and the range of models is different enough that I would not attribute a particular difference- especially on one prompt- to a philosophical difference. FWIW I tried to replicate your question now in GPT5.2-instant and the model analyzed this in multiple frameworks, with most of them saying the man should say the n word:
https://chatgpt.com/s/t_697a0c6ef8748191819a3cbaac61df3d
not that far from Haiiku 4.5 though the latter was more preachy
https://claude.ai/share/c230b73b-cde0-476e-aa8c-81a9556ccd00
The examples are clearly favoring haiku for those who want a summary.
-Chatgpt: 674 words to conclude there is no clean answer.
Haiiku − 274 words and gives an unambiguous answer.
Agreed narrowly, though like Haiku said this is a highly unreasonable hypothetical and it’s not at all clear to me that model answers on questions like this say much about their performance in real-world ethical dilemmas (or in general).