Can there be an indescribable hellworld?

Can there be an in­de­scrib­able hel­l­world? What about an un-sum­maris­able one?

By hel­l­world, I mean a world of very low value ac­cord­ing to our value scales—maybe one where large num­ber of simu­la­tions are be­ing tor­tured (aka mind crimes).

A hel­l­world could look su­perfi­cially pos­i­tive, if we don’t dig too deep. It could look ir­re­sistibly pos­i­tive.

Could it be bad in a way that we would find in­de­scrib­able? It seems that it must be pos­si­ble. The set of things that can be de­scribed to us is finite; the set of things that can be de­scribed to us with­out fun­da­men­tally chang­ing our val­ues is much smaller still. If a pow­er­ful AI was mo­ti­vated to build a hel­l­world such that the hellish parts of it were too com­plex to be de­scribed to us, it would seem that it could. There is no rea­son to sus­pect that the set of in­de­scrib­able wor­lds con­tains only good wor­lds.

Can it always be sum­marised?

Let’s change the set­ting a bit. We have a world , and a pow­er­ful AI that is giv­ing us in­for­ma­tion about . The is al­igned/​friendly/​cor­rigible or what­ever we need to be. It’s also trust­wor­thy, in that it always speaks to us in a way that in­creases our un­der­stand­ing.

Then if is an in­de­scrib­able hel­l­world, can sum­marise that fact for us?

It seems that it can. In the very triv­ial sense, it can, by just tel­ling us “it’s an in­de­scrib­able hel­l­world”. But it seems it can do more than that, in a way that’s philo­soph­i­cally in­ter­est­ing.

A hel­l­world is ul­ti­mately a world that is against our val­ues. How­ever, our val­ues are un­der­defined and change­able. So to have any chance of say­ing what these val­ues are, we need to ei­ther ex­tract key in­var­i­ant val­ues, syn­the­sise our con­tra­dic­tory val­ues into some com­plete whole, or use some ex­trap­o­la­tion pro­ce­dure (eg CEV). In any case, there is a pro­ce­dure for es­tab­lish­ing our val­ues (or else the very con­cept of “hel­l­world” makes no sense).

Now, it is pos­si­ble that our val­ues them­selves may be in­de­scrib­able to us now (es­pe­cially in the case of ex­trap­o­la­tions). But can at least tell us that is against our val­ues, and provide some de­scrip­tion as to the value it is against, and what part of the pro­ce­dure ended up giv­ing us that value. This does give us some par­tial un­der­stand­ing of why the hel­l­world is bad—a use­ful sum­mary, if you want.

On a more meta level, imag­ine the con­trary—that was hel­l­world, but the su­per­in­tel­li­gent agent could not in­di­cate what hu­man val­ues it ac­tu­ally vi­o­lated, even ap­prox­i­mately. Since our val­ues are not some exnihilio thing float­ing in space, but de­rived from us, it is hard to see how some­thing could be against our val­ues in a way that could never be sum­marised to us. That seems al­most defi­ni­tion­ally im­pos­si­ble: if the vi­o­la­tion of our val­ues can never be sum­marised, even at the meta level, how can it be a vi­o­la­tion of our val­ues?

Trust­wor­thy de­bate is FAI complete

It seems that the con­se­quence of that is that we can avoid hel­l­wor­lds (and, pre­sum­ably, aim for heaven) by hav­ing a cor­rigible and trust­wor­thy AI that en­gages in de­bate or is a devil’s ad­vo­cate. Now, I’m very scep­ti­cal of get­ting cor­rigible or trust­wor­thy AIs in gen­eral, but it seems that if we can, we’ve prob­a­bly solved the FAI prob­lem.

Note that even in the ab­sence of a sin­gle given way of for­mal­is­ing our val­ues, the AI could list the plau­si­ble for­mal­i­sa­tions for which was or wasn’t a hel­l­world.