Kaj_Sotala comments on The Preference Fulfillment Hypothesis

Kaj_Sotala 26 Feb 2023 11:11 UTC
LW: 22 AF: 5
13
AF
An observation: it feels slightly stressful to have posted this. I have a mental simulation telling me that there are social forces around here that consider it morally wrong or an act of defection to suggest that alignment might be relatively easy, like it implied that I wasn’t taking the topic seriously enough or something. I don’t know how accurate that is, but that’s the vibe that my simulators are (maybe mistakenly) picking up.
- MSRayne 26 Feb 2023 13:43 UTC
  12 points
  4
  Parent
  That’s exactly why it’s wonderful and important that you in fact have posted this. For what it’s worth, I agree-ish with you. I have enough security mindset and general paranoia to be very scared the Eliezer is right and we’re all doomed, but I also have enough experience with LLMs and with phenomenology / introspection to realize that it just doesn’t seem like it would be that hard to make an entity that cares intrinsically about other entities.
  - rvnnt 26 Feb 2023 14:23 UTC
    7 points
    2
    Parent
    Some voice in my head started screaming something like
    
    “A human introspecting on human phenomenology does not provide reliable evidence about artificial intelligences! Remember that humans have a huge amount of complex built-in circuitry that operates subconsciously and makes certain complicated things—like kindness—feel simple/easy!”
    
    Thought I’d share that. Wondering if you disagree with the voice.
    - MSRayne 26 Feb 2023 20:49 UTC
      2 points
      −6
      Parent
      I disagree very strongly with the voice. Those complicated things are only complicated because we don’t introspect about them hard enough, not for any intrinsic reasons. I also think most people just don’t have enough self-awareness to be able to perceive their thoughts forming and get a sense of the underlying logic. I’m not saying I can do it perfectly, but I can do it much better than the average person. Consider all the psychological wisdom of Buddhism, which came from people without modern science just paying really close attention to their own minds for a long time.
      - Seth Herd 24 Jun 2025 3:01 UTC
        2 points
        0
        Parent
        I agree with you about introspection being powerful and underused. I have believed that and relied heavily on it through my 25-year career in cognitive psychology and neuroscience, and I believe introspecting seriously and carefully gave me an edge.
        
        I do not think introspection gives you insight on the original sources of your motivations. I think there are powerful instincts in play that LLMs do not share.
        
        I also agree with you that you might well be able to bootstrap current LLMs to greater capabilities. It’s moving slower than I thought it would, but it is happening. I’m curious if you’ve pursued this or given up on it. I do think that advancing scaffolded LLM agents to greater capabilities is probably good for our odds. I have some elaborate reasoning on that, and I’m not sure but at this point I think it’s likely enough that I’m willing to help with LLM agent projects.
      - rvnnt 27 Feb 2023 13:19 UTC
        2 points
        1
        Parent
        Interesting!
        
        Those complicated things are only complicated because we don’t introspect about them hard enough, not for any intrinsic reasons.
        
        My impression is that the human brain is in fact intrinsically quite complex!^[1]
        
        I also think most people just don’t have enough self-awareness to be able to perceive their thoughts forming and get a sense of the underlying logic.
        
        I think {most people’s introspective abilities} are irrelevant. (But FWIW, given that lots of people seem to e.g. conflate a verbal stream with thought, I agree that median human introspective abilities are probably kinda terrible.)
        
        Consider all the psychological wisdom of Buddhism [...]
        
        Unfortunately I’m not familiar with the wisdom of Buddhism; so that doesn’t provide me with much evidence either way :-/
        
        An obvious way to test how complex a thing X really is, or how well one understands it, is to (attempt to) implement it as code or math. If the resulting software is not very long, and actually captures all the relevant aspects of X, then indeed X is not very complex.
        
        Are you able to write software that implements (e.g.) kindness, prosociality, or “an entity that cares intrinsically about other entities”^[2]? Or write an informal sketch of such math/code? If yes, I’d be very curious to see it! ^[3]
        
        ↩︎
        Like, even if only ~1% of the information in the human genome is about how to wire the human brain, that’d still be ~10 MB worth of info/code. And that’s just the code for how to learn from vast amounts of sensory data; an adult human brain would contain vastly more structure/information than that 10 MB. I’m not sure how to estimate how much, but given the vast amount “training data” and “training time” that goes into a human child, I wouldn’t be surprised if it were in the ballpark of hundreds of terabytes. If even 0.01% of that info is about kindness/prosociality/etc., then we’re still talking of something like 10 GB worth of information. This (and other reasoning) leads me to feel moderately sure that things like “kindness” are in fact rather complex.
        
        ↩︎
        ...and hopefully, in addition to “caring about other entities”, also tries to do something like “and implement the other entities’ CEV”.
        
        ↩︎
        Please don’t publish anything infohazardous, though, obviously.
        
        MSRayne 27 Feb 2023 13:28 UTC
        2 points
        0
        Parent
        It would in fact be infohazardous, but yes, I’ve kinda been doing all this introspection for years now with the intent of figuring out how to implement it in an AGI. In particular, I think there’s a nontrivial possibility that GPT-2 by itself is already AGI-complete and just needs to be prompted in the right intricate pattern to produce thoughts in a similar structure to how humans do. I do not have access to a GPU, so I cannot test and develop this, which is very frustrating to me.
        I’m almost certainly wrong about how simple this is, but I need to be able to build and tweak a system actively in order to find out—and in particular, I’m really bad at explaining abstract ideas in my head, as most of them are more visual than verbal.
        One bit that wouldn’t be infohazardous though afaik is the “caring intrinsically about other entities” bit. I’m sure you can see how a sufficiently intelligent language model could be used to predict, given a simulated future scenario, whether a simulated entity experiencing that scenario would prefer, upon being credibly given the choice, for the event to be undone / not have happened in the first place. This is intended to parallel the human ability—indeed, automatic subconscious tendency—to continually predict whether an action we are considering will contradict the preferences of others we care about, and choose not to do it if it will.
        So, a starting point would be to try to make a model which is telling a story, but regularly asks every entity being simulated if they want to undo the most recent generation, and does so if even one of them asks to do it. Would this result in a more ethical sequence of events? That’s one of the things I want to explore.
- rvnnt 26 Feb 2023 14:10 UTC
  10 points
  6
  Parent
  I’m guessing this might be due to something like the following:
  - (There is a common belief on LW that) Most people do not take AI x/s-risk anywhere near seriously enough; that most people who do think about AI x/s-risk are far too optimistic about how hard/easy alignment is; that most people who do concede >10% p(doom) are not actually acting with anywhere near the level of caution that their professed beliefs would imply to be sensible.
  - If alignment indeed is difficult, then (AI labs) acting based on optimistic assumptions is very dangerous, and could lead to astronomical loss of value (or astronomical disvalue)
  - Hence: Pushback against memes suggesting that alignment might be easy.
  I think there might sometimes be something going on along the lines of “distort the Map in order to compensate for a currently-probably-terrible policy of acting in the Territory”.
  
  Analogy: If, when moving through Territory, you find yourself consistently drifting further east than you intend, then the sane solution is to correct how you move in the Territory; the sane solution is not to skew your map westward to compensate for your drifting. But what if you’re stuck in a bus steered by insane east-drifting monkeys, and you don’t have access to the steering wheel?
  
  Like, if most people are obviously failing egregiously at acting sanely in the face of x/s-risks, due to those people being insane in various ways
  
  (“but it might be easy!”, “this alignment plan has the word ‘democracy’ in it, obviously it’s a good plan!”, “but we need to get the banana before those other monkeys get it!”, “obviously working on capabilities is a good thing, I know because I get so much money and status for doing it”, “I feel good about this plan, that means it’ll probably work”, etc.),
  
  then one of the levers you might (subconsciously) be tempted to try pulling is people’s estimate of p(doom). If everyone were sane/rational, then obviously you should never distort your probability estimates. But… clearly everyone is not sane/rational.
  
  If that’s what’s going on (for many people), then I’m not sure what to think of it, or what to do about it. I wish the world were sane?
  - Dave Orr 26 Feb 2023 15:22 UTC
    6 points
    4
    Parent
    I feel like every week there’s a post that says, I might be naive but why can’t we just do X, and X is already well known and not considered sufficient. So it’s easy to see a post claiming a relatively direct solution as just being in that category.
    
    The amount of effort and thinking in this case, plus the reputation of the poster, draws a clear distinction between the useless posts and this one, but it’s easy to imagine people pattern matching into believing that this is also probably useless without engaging with it.
    - rvnnt 27 Feb 2023 13:17 UTC
      1 point
      0
      Parent
      (Ah, to clarify: I wasn’t saying that Kaj’s post seems insane; I was referring to the fact that lots of thinking/discourse in general about AI seems to be dangerously insane.)
  - Kaj_Sotala 26 Feb 2023 15:49 UTC
    3 points
    0
    Parent
    This seems correct to me.
- stavros 26 Feb 2023 13:37 UTC
  8 points
  0
  Parent
  In my experience the highest epistemic standard is achieved in the context of ‘nerds arguing on the internet’. If everyone is agreeing, all you have is an echo chamber.
  
  I would argue that good faith, high effort contributions to any debate are something we should always be grateful for if we are seeking the truth.
  
  I think the people who would be most concerned with ‘anti-doom’ arguments are those who believe it is existentially important to ‘win the argument/support the narrative/spread the meme’ - that truthseeking isn’t as important as trying to embed a cultural norm of deep deep precaution around AI.
  
  To those people, I would say: I appreciate you, but there are better strategies to pursue in the game you’re trying to play. Collective Intelligence has the potential to completely change the game you’re trying to play and you’re pretty well positioned to leverage that.
- the gears to ascension 8 Apr 2023 0:02 UTC
  4 points
  0
  Parent
  I think you’re on the right track here and also super wrong. Congrats on being specific enough to be wrong, because this is insightful about things I’ve also been thinking, and allows people to be specific about where likely holes are. Gotta have negative examples to make progress! Don’t let the blurry EY in your head confuse your intuitions ;)
- Dave Orr 26 Feb 2023 13:23 UTC
  4 points
  0
  Parent
  FWIW I at least found this to be insightful and enlightening. This seems clearly like a direction to explore more and one that could plausibly pan out.
  
  I wonder if we would need to explore beyond the current “one big transformer” setup to realize this. I don’t think humans have a specialized brain region for simulations (though there is a region that seems heavily implicated, see https://www.mountsinai.org/about/newsroom/2012/researchers-identify-area-of-the-brain-that-processes-empathy), but if you want to train something using gradient descent, it might be easier if you have a simulation module that predicts human preferences and is rewarded for accurate predictions, and then feed those into the main decision-making model.
  
  Perhaps we can use revealed preferences through behavior combined with elicited preferences to train the preference predictor. This is similar to the idea of training a separate world model rather than lumping it in with the main blob.