Can there be an indescribable hellworld?

Stuart_Armstrong29 Jan 2019 15:00 UTC

LW: 39 AF: 22

Debate (AI safety technique)Complexity of Value

Can there be an indescribable hellworld? What about an un-summarisable one?

By hellworld, I mean a world of very low value according to our value scales—maybe one where large number of simulations are being tortured (aka mind crimes).

A hellworld could look superficially positive, if we don’t dig too deep. It could look irresistibly positive.

Could it be bad in a way that we would find indescribable? It seems that it must be possible. The set of things that can be described to us is finite; the set of things that can be described to us without fundamentally changing our values is much smaller still. If a powerful AI was motivated to build a hellworld such that the hellish parts of it were too complex to be described to us, it would seem that it could. There is no reason to suspect that the set of indescribable worlds contains only good worlds.

Can it always be summarised?

Let’s change the setting a bit. We have a world $W$ , and a powerful AI $A$ that is giving us information about $W$ . The $A$ is aligned/friendly/corrigible or whatever we need to be. It’s also trustworthy, in that it always speaks to us in a way that increases our understanding.

Then if $W$ is an indescribable hellworld, can $A$ summarise that fact for us?

It seems that it can. In the very trivial sense, it can, by just telling us “it’s an indescribable hellworld”. But it seems it can do more than that, in a way that’s philosophically interesting.

A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of “hellworld” makes no sense).

Now, it is possible that our values themselves may be indescribable to us now (especially in the case of extrapolations). But $A$ can at least tell us that $W$ is against our values, and provide some description as to the value it is against, and what part of the procedure ended up giving us that value. This does give us some partial understanding of why the hellworld is bad—a useful summary, if you want.

On a more meta level, imagine the contrary—that $W$ was hellworld, but the superintelligent agent $A$ could not indicate what human values it actually violated, even approximately. Since our values are not some exnihilio thing floating in space, but derived from us, it is hard to see how something could be against our values in a way that could never be summarised to us. That seems almost definitionally impossible: if the violation of our values can never be summarised, even at the meta level, how can it be a violation of our values?

Trustworthy debate is FAI complete

It seems that the consequence of that is that we can avoid hellworlds (and, presumably, aim for heaven) by having a corrigible and trustworthy AI that engages in debate or is a devil’s advocate. Now, I’m very sceptical of getting corrigible or trustworthy AIs in general, but it seems that if we can, we’ve probably solved the FAI problem.

Note that even in the absence of a single given way of formalising our values, the AI could list the plausible formalisations for which $W$ was or wasn’t a hellworld.

What links here?

Stuart_Armstrong29 Jan 2019 15:00 UTC

LW: 39 AF: 22

19 comments2 min readLW link

Debate (AI safety technique)Complexity of Value

Kaj_Sotala 29 Jan 2019 15:50 UTC
5 points
A hellworld is ultimately a world that is against our values. However, our values are underdefined and changeable. So to have any chance of saying what these values are, we need to either extract key invariant values, synthesise our contradictory values into some complete whole, or use some extrapolation procedure (eg CEV). In any case, there is a procedure for establishing our values (or else the very concept of “hellworld” makes no sense).
It feels worth distinguishing between two cases of “hellworld”:
1. A world which is not aligned with the values of that world’s inhabitants themselves. One could argue that in order to merit the designation “hellworld”, the world has to be out of alignment with the values of its inhabitants in such a way as to cause suffering. Assuming that we can come up with a reasonable definition of suffering, then detecting these kinds of worlds seems relatively straightforward: we can check whether they contain immense amounts of suffering.
2. A world whose inhabitants do not suffer, but which we might consider hellish according to our values. For example, something like a Brave New World scenario, where people generally consider themselves happy but where that happiness comes at the cost of suppressing individuality and promoting superficial pleasures.
It’s for detecting an instance of the second case that we need to understand our values better. But it’s not clear to me that such a world should qualify as a “hellworld”, which to me sounds like a world with negative value. While I don’t find the notion of being the inhabitant of a Brave New World particularly appealing, a world where most people are happy but only in a superficial way sounds more like “overall low positive value” than “negative value” to me. Assuming that you’ve internalized its values and norms, existing in a BNW doesn’t seem like a fate worse than death, it just sounds like a future that could have gone better.
Of course, there is an argument that even if a BNW would be okay to its inhabitants once we got there, getting there might cause a lot of suffering: for instance, if there were lots of people who were forced against their will to adapt to the system. Since many of us might find the BNW to be a fate worse than death, then conditional on us surviving to live in the BNW, it’s a hellworld (at least to us). But again this doesn’t seem like it requires a thorough understanding of our values to detect: it just requires detecting the fact that if we survive to live in the BNW, we will experience a lot of suffering due to being in a world which is contrary to our values.
- Stuart_Armstrong 30 Jan 2019 18:10 UTC
  4 points
  Parent
  
  Assuming that we can come up with a reasonable definition of suffering
  
  Checking whether there is a large amount of suffering in a deliberately obfuscated world seems hard, or impossible if a superintelligent has done the obfuscating.
  - Kaj_Sotala 30 Jan 2019 20:39 UTC
    2 points
    Parent
    True, not disputing that. Only saying that it seems like an easier problem than solving human values first, and then checking whether those values are satisfied.
John_Maxwell 30 Jan 2019 0:05 UTC
2 points

the set of things that can be described to us without fundamentally changing our values is much smaller still

What’s the evidence for this set being “much smaller”?
- Stuart_Armstrong 30 Jan 2019 18:08 UTC
  2 points
  Parent
  Can you imagine sitting through a ten-year lecture without your values changing? Can you imagine sitting through that lecture without your values changing somewhat in reaction to the content?
  - Kaj_Sotala 30 Jan 2019 20:40 UTC
    4 points
    Parent
    This seems like it would mainly affect instrumental values rather than terminal ones.
    - Stuart_Armstrong 30 Jan 2019 21:38 UTC
      3 points
      Parent
      In many areas, we have no terminal values until the problem is presented to us, then we develop terminal values (often dependent on how the problem was phrased) and stick to them. Eg the example with Soviet and American journalists visiting each other’s countries.
Dagon 29 Jan 2019 20:39 UTC
2 points
I think a little more formal definition of “describable” and “summarizable” would help me understand. I start with a belief that any world is it’s own best model, so I don’t think worlds are describable in full I may be wrong—world-descriptions may compress incredibly well, and it’s possible to describe a world IN ANOTHER WORLD. but fully describing a world inside a subset of that world itself cannot be done.
“summarizable” is more interesting. If it just means “a trustworthy valuation”, then fine—it’s possible, and devolves to “is there any such thing as a trustworthy summarizer”. If it means some other subset of a description, then it may or may not be possible.
- Dagon 31 Jan 2019 21:36 UTC
  2 points
  Parent
  Thinking about other domains, a proof is a summary of (an aspect of) a formal system. It provides a different level of information than is contained in the base axioms. Can we model “summary” of the suffering level/ratio/hellishness of a world in the same terms. It’s not about trusting the agent, it’s about the agent finding the subset of information about the world that shows us that the result is true.
avturchin 15 Mar 2019 20:57 UTC
1 point
Maybe the right question here is: is it possible to create more and more strong qualia of pain, or the level of pain is limited.
If maximum level of pain is limited, by, say, 10 of 10, when evil AI have to create complex worlds, like in the story “I have not mouth but I must scream”, trying to affect many our values in most unpleasant combination, that is playing anti-music by pressing different values.
If there is no limits to the possible intensity of pain, the evil AI will invest more in upgrading human brain so it will be able to feel more and more pain. In that case there will be no complexity but just growing intensity. One could see this type of hell in the ending of the last Trier movie “The house that Jack built”. This type of hell is more disturbing to me.
In the Middle Ages the art of torture existed, and this distinction also existed: some tortures were sophisticated, but other were simple but infinitely intense, like the testicle torture.
- Stuart_Armstrong 18 Mar 2019 9:27 UTC
  3 points
  0
  Parent
  But you seem to have described these hells quite well—enough for us to clearly rule them out.
  - avturchin 18 Mar 2019 10:07 UTC
    1 point
    Parent
    I don’t understand why you are ruling them out completely: at least at personal level long intense suffering do exist and happened in mass in the past (cancer patients, concentration camps, witch hunting).
    I suggested two different argument against s-risks:
    1) Anthropic: s-risks are not dominating type of experience in the universe, or we will be already here.
    2) Larger AIs could “save” minds from smaller but evil AIs by creating many copies of such minds and thus creating indexical uncertainty (detailed explanation here), as well as punish copies of such AI for this, and thus discouraging any AI to implement s-risks.
    - Stuart_Armstrong 18 Mar 2019 16:15 UTC
      2 points
      Parent
      The question of this post is whether there exist indescribable hellworlds—worlds that are bad, but where it cannot be explained to humans how/why they are bad.
      - avturchin 18 Mar 2019 20:31 UTC
        1 point
        Parent
        Yes, I probably understood “indescribable” as a synonymous of “very intense”, not of literary “can’t be described”.
        But now I have one more idea about really “indescribable hellworld”: imagine that there is a qualia of suffering which is infinitely worse than anything that any living being ever felt on Earth, and it appears in some hellword, but only in animals or in humans who can’t speak (young children, patients just before death, or it paralises the ability to speak by its intensity and also can’t be remembered—I read historical cases of pain so intense that a person was not able to provide very important information).
        So, this hellworld will look almost as our normal world: animals live and die, people live normal and happy (in time-average) lives and also die. But some counterfactual observer which will be able to feel qualia of any living being will find it infinitely more hellish than our world.
        We also could live now in such hellworld but don’t know it.
        The main reason why it can’t be described as most people don’t believe in qualia, and and observable characteristics of this world will be not hellish. Beings in such world could be also called reverse-p-zombies, as they have much more stronger capability to “experiencing” than ordinary humans.
        Stuart_Armstrong 19 Mar 2019 10:25 UTC
        2 points
        Parent
        
        We also could live now in such hellworld but don’t know it.
        
        Indeed. But you’ve just described it to us ^_^
        
        What I’m mainly asking is “if we end up in world $W$ , and no honest AI can describe to us how this might be a hellworld, is it automatically not a hellworld?”
        
        avturchin 19 Mar 2019 10:53 UTC
        1 point
        Parent
        It looks like examples are not working here, as any example is an explanation, so it doesn’t count :)
        But in some sense it could be similar to the Godel theorem: there are true propositions which can’t be proved by AI (and explanation could be counted as a type of prove).
        Ok, another example: there are bad pieces of art, I know it, but I can’t explain why they are bad in formal language.
        Stuart_Armstrong 19 Mar 2019 13:11 UTC
        3 points
        Parent
        
        Godel theorem: there are true propositions which can’t be proved by AI (and explanation could be counted as a type of prove).
        
        That’s what I’m fearing, so I’m trying to see if the concept makes sense.
William_S 31 Jan 2019 20:03 UTC
1 point
Do you think you’d agree with a claim of this form applied to corrigibility of plans/policies/actions?
That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.
- Stuart_Armstrong 1 Feb 2019 13:41 UTC
  2 points
  Parent
  Given some definition of corrigibility, yes.