TurnTrout comments on Humans provide an untapped wealth of evidence about alignment

TurnTrout 24 Jul 2022 23:01 UTC
LW: 2 AF: 2
0
AF
It is not that human values are particularly stable
I might agree or disagree with this statement, depending on what “particularly stable” means. (Also, is there a portion of my post which seems to hinge on “stability”?)
we identify the stable parts of ourselves as “our human values”.
I don’t see why you think this.
if we allow humans arbitrary self-modification and intelligence increase—the parts of us that are stable will change, and will likely not include much of our current values.
Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, …), I would choose a pill such that my new values would be almost completely unaligned with my old values?
- Stuart_Armstrong 25 Jul 2022 20:32 UTC
  LW: 4 AF: 2
  0
  AF Parent
  
  Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, …), I would choose a pill such that my new values would be almost completely unaligned with my old values?
  
  This is the wrong angle, I feel (though it’s the angle I introduced, so apologies!). The following should better articulate my thoughts:
  
  We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI is constrained and weak, it continues to increase the value of the company; when it becomes powerful, it wireheads and takes over the stock price ticker.
  
  Now that wireheading is a perfectly correct extrapolation of its reward function; it hasn’t “changed” its reward function, it simply has gained the ability to control its environment well, so that it now can decorrelate the stock ticker from the company value.
  
  Notice the similarity with humans who develop contraception so they can enjoy sex without risking childbirth. Their previous “values” seemed to be a bundle of “have children, enjoy sex” and this has now been wireheaded into “enjoy sex”.
  
  Is this a correct extrapolation of prior values? In retrospect, according to our current values, it seems to mainly be the case. But some people strongly disagree even today, and, if you’d done a survey of people before contraception, you’d have got a lot of mixed responses (especially if you’d got effective childbirth medicine long before contraceptives). And if we want to say that the “true” values have been maintained, we’d have to parse the survey data in specific ways, that others may argue with.
  
  So we like to think that we’ve maintained our “true values” across these various “model splinterings”, but it seems more that what we’ve maintained has been retrospectively designated as “true values”. I won’t go the whole hog of saying “humans are rationalising beings, rather than rational ones”, but there is at least some truth to that, so it’s never fully clear what our “true values” really were in the past.
  
  So if you see humans as examples of entities that maintain their values across ontology changes and model splinterings, I would strongly disagree. If you see them as entities that sorta-kinda maintain and adjust their values, preserving something of what happened before, then I agree. That to me is value extrapolation, for which humans have shown a certain skill (and many failings). And I’m very interested in automating that, though I’m sceptical that the purely human version of it can extrapolate all the way up to superintelligence.
  - TurnTrout 31 Jul 2022 17:02 UTC
    LW: 4 AF: 3
    2
    AF Parent
    Hm, thanks for the additional comment, but I mostly think we are using words and frames differently, and disagree with my understanding of what you think values are.
    We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI is constrained and weak, it continues to increase the value of the company; when it becomes powerful, it wireheads and takes over the stock price ticker.
    Reward is not the optimization target.
    Their previous “values” seemed to be a bundle of “have children, enjoy sex” and this has now been wireheaded into “enjoy sex”.
    I think this is not what happened. Those desires are likely downstream of past reinforcement of different kinds; I do not think there is a “wireheading” mechanism here. Wireheading is a very specific kind of antecedent-computation-reinforcement chasing behavior, on my ontology.
    I’m sceptical that the purely human version of it can extrapolate all the way up to superintelligence.
    Not at all what I’m angling at. There’s a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don’t copy the algorithm.
    - Stuart_Armstrong 1 Aug 2022 10:00 UTC
      LW: 2 AF: 2
      0
      AF Parent
      
      Not at all what I’m angling at. There’s a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don’t copy the algorithm.
      
      I agree that humans navigate “model splinterings” quite well. But I actually think the algorithm might be more important than the generators. The generators comes from evolution and human experience in our actual world; this doesn’t seem like it would generalise. The algorithm itself, though, may very generalisable (potential analogy: humans have instinctive grasp of all numbers under five, due to various evolutionary pressures, but we produced the addition algorithm that is far more generalisable).
      
      I’m not sure that we disagree much. We may just have different emphases and slightly different takes on the same question?
      - TurnTrout 1 Aug 2022 18:35 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I’m not sure that we disagree much.
        Yes and no. I think most of our disagreements are probably like “what is instinctual?” and “what is the type signature of human values?” etc. And not on “should we understand what people are doing?”.
        The generators comes from evolution and human experience in our actual world
        By “generators”, I mean “the principles by which the algorithm operates”, which means the generators are found by studying the within-lifetime human learning process.
        potential analogy: humans have instinctive grasp of all numbers under five, due to various evolutionary pressures
        Dubious to me due to information inaccessibility & random initialization of neocortex (which is a thing I am reasonably confident in). I think it’s more likely that our architecture&compute&learning process makes it convergent to learn this quick ⇐ 5 number-sense.