What I’m objecting to is the claim that the traits we associate with evil (being a dictator, a ruthless CEO, a scammer) make someone so bad at the reflection process that their extrapolated output would be worse than what you’d get by extrapolating a random non-human mammal, or a current LLM like Claude or ChatGPT[1].
Your priors are reasonable. The CEV of a random human is closer to my CEV than that of a random non-human mammal, or a random current LLM. The evidence of Putin’s behavior doesn’t move your beliefs much. So, you would prefer Putin’s CEV to Claude’s CEV. It’s hypothetical because we don’t have a way to achieve either, today.
(I wrote a List of Human Lethalities draft, but I don’t think it’s novel)
In 2030, if we are alive and have an intent-aligned AI, we must have made huge strides in interpretability and alignment. At that point we will also have a lot more evidence about virtue-aligned AIs and a lot better at aligning them to virtue. We won’t have any more evidence about humans. So in 2030 it will be better to hand the intent-aligned AI to the best virtue-aligned AI, “Viraj”, than to a human. Or, equivalently, hand control directly to Viraj.
In that hypothetical 2030, it would be sad if a human took control of the intent-aligned AI instead of Viraj. We can avoid this sadness by not training intent-aligned AIs and instead only training virtue-aligned AIs. This also improves the prospects for cooperation. Stealing or launching an incompletely aligned virtue-aligned AI is less effective for the defector, and it’s possible to collaborate on the intended virtues.
Your priors are reasonable. The CEV of a random human is closer to my CEV than that of a random non-human mammal, or a random current LLM. The evidence of Putin’s behavior doesn’t move your beliefs much. So, you would prefer Putin’s CEV to Claude’s CEV. It’s hypothetical because we don’t have a way to achieve either, today.
(I wrote a List of Human Lethalities draft, but I don’t think it’s novel)
In 2030, if we are alive and have an intent-aligned AI, we must have made huge strides in interpretability and alignment. At that point we will also have a lot more evidence about virtue-aligned AIs and a lot better at aligning them to virtue. We won’t have any more evidence about humans. So in 2030 it will be better to hand the intent-aligned AI to the best virtue-aligned AI, “Viraj”, than to a human. Or, equivalently, hand control directly to Viraj.
In that hypothetical 2030, it would be sad if a human took control of the intent-aligned AI instead of Viraj. We can avoid this sadness by not training intent-aligned AIs and instead only training virtue-aligned AIs. This also improves the prospects for cooperation. Stealing or launching an incompletely aligned virtue-aligned AI is less effective for the defector, and it’s possible to collaborate on the intended virtues.