Take 3: No indescribable heavenworlds.

As a writing exercise, I’m writing an AI Alignment Hot Take Advent Calendar—one new hot take, written every day for 25 days. Or until I run out of hot takes.

Some people think as if there are indescribable heavenworlds. They’re wrong, and this is important to AI alignment.

This is an odd accusation given that I made up the phrase “indescribable heavenworld” myself, so let me explain. It starts not with heavenworlds, but with Stuart Armstrong writing about the implausibility of indescribable hellworlds.

A hellworld, is, obviously, a bad state of the world. An indescribable hellworld is a state of the world where where everything looks fine at first, and then you look closer and everything still looks fine, and then you sit down and think about it abstractly and it still seems fine, and then you go build tools to amplify your capability to inspect the state of the world and they say it’s fine, but actually, it’s bad.

If the existence of such worlds sounds plausible to you, then I think you might enjoy and benefit from trying to grok the metaethics sequence.

Indescribable hellworlds are sort of like the reductio of an open question argument. Open question arguments say that no matter what standard of goodness you set, if it’s a specific function of the state of the world then it’s an open question whether that function is actually good or not (and therefore moral realism). For a question to really be open, it must be possible to get either answer—and indescribable hellworlds are what keep the question open even if we use the standard of all of human judgment, human cleverness, and human reflectivity.

If you read Reducing Goodhart, you can guess some things I’d say about indescribable hellworlds. There is no unique standard of “explainable,” and you can have worlds that are the subject of inter-standard conflict (even supposing badness is fixed), which can sort of look like indescribable badness. But ultimately, the doubt over whether some world is bad puts a limit on how hellish it can really be, sort of like harder choices matter less. A preference that can’t get translated into some influence on my choices is a weak preference indeed.

An indescribable heavenworld is of course the opposite of an indescribable hellworld. It’s a world where everything looks weird and bad at first, and then you look closer and it still looks weird and bad, and you think abstractly and yadda yadda still seems bad, but actually, it’s the best world ever.

Indescribable heavenworlds come up when thinking about what happens if everything goes right. “What if”—some people wonder—“the glorious post-singularity utopia is actually good in ways that are impossible for human to comprehend? That would, by definition, be great, but I worry that some people might try to stop that glorious future from happening by trying to rate futures using their present preferences /​ judgment /​ cleverness /​ reflectivity. Don’t leave your fingerprints on the future, people!”

No indescribable heavenworlds. If a future is good, it’s good for reasons that make sense to me—maybe not at first glance, but hopefully at second glance, or after some abstract thought, or with the assistance of some tools whose chain of logic makes sense to me. If the future realio trulio seems weird and bad after all that work, it’s not secretly great, we probably just messed up.