Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 26 Apr 2025 15:15 UTC
LW: 3 AF: 2
0
AF
under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:
- Best friends, and friend groups, do exist, and people do really enjoy and value them.
- The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.)
- As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosity drive, etc., because those determine the object-level preferences that I would be neglecting in favor of following Zoe’s whims.
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
As I always say, my claim is “understanding human social instincts seems like it would probably be useful for Safe & Beneficial AGI”, not “here is my plan for Safe & Beneficial AGI…”. :) That said, copying from §5.2 here:
…Alternatively, there’s a cop-out option! If the outcome of the [distributional shifts] is so hard to predict, we could choose to endorse the process while being agnostic about the outcome.
There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them. The problem is the same—the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions. But this is how it’s always been, and we generally feel good about it. [Not that we’ve ever had a choice!]
So the idea here is, we could make AIs with social and moral drives that we’re happy about, probably because those drives are human-like in some respects [but probably not in all respects!], and then we could define “good” as whatever those AIs wind up wanting, upon learning and reflection.
- Wei Dai 26 Apr 2025 23:53 UTC
  LW: 4 AF: 4
  0
  AF Parent
  
  As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.
  
  Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don’t have a complete understanding yet, so I’m just flagging this as a potential puzzle, not demanding an immediate answer.)
  
  There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them.
  
  I’m only “happy” to “entrust the future to the next generation of humans” if I know that they can’t (i.e., don’t have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we’re still in an arguably ok position. However AI will make it false very quickly by advancing technology.
  
  So I really want the AI transition to be an opportunity for improving the basic dynamic of “the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions”, for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it’s always been this way, so it’s fine to continue.
  
  I’m going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...
  - Steven Byrnes 27 Apr 2025 13:36 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Thanks!
    Why does this happen in the first place
    There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competition over the conversation topic—Joey keeps changing the subject to trains, Billy to dinosaurs. Why do they each want to share / show off their new knowledge? Because of “drive to feel liked / admired”. From Joey’s perspective, the moment of telling Billy a cool new fact about trains, and seeing the excited look on Billy’s face, is intrinsically motivating. And vice-versa for Billy. But they can’t both satisfy that innate drive simultaneously.
    OK, then you’ll say, why didn’t they both read the same book last night? Well, maybe the library only had one copy. But also, if they had both read the same book, and thus both learned the same cool facts about bulldozers, then that’s a lose-lose instead of a win-win! Neither gets to satisfy their drive to feel liked/admired by thrilling the other with cool exciting new-to-them facts.
    Do you buy that? Happy to talk through more vignettes.
    I’m only “happy” to “entrust the future to the next generation of humans” if I know that they can’t (i.e., don’t have the technology to) do something irreversible…
    OK, here’s another way to put it. When humans do good philosophy and ethics, they’re combining their ability to observe and reason (the source of all “is”) with their social and moral instincts / reflexes (the source of all “ought”, see Valence §2.7).
    If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes. (And if it’s not possible for this process to lead somewhere good, then we’re screwed no matter what.)
    Indeed, there are reasons to think the odds of this process leading somewhere good might be better in an AI than in humans, since the programmers can in principle turn the knobs on the various social and moral drives. (Some humans are more motivated by status and self-aggrandizement than others, and I think this is related to interpersonal variation in the strengths of different innate drives. It would be nice to just have a knob in the source code for that!) To be clear, there are also reasons to think the odds might be worse in an AI, since humans are at least a known quantity, and we might not really know what we’re doing in AGI programming until it’s too late.
    On my current thinking, it will be very difficult to avoid fast takeoff and unipolar outcomes, for better or worse. (Call me old-fashioned.) (I’m stating this without justification.) If the resulting singleton ASI has the right ingredients to do philosophy and ethics (i.e., it has both the ability to reason, and human-like social and moral instincts / reflexes), but with superhuman speed and insight, then the future could turn out great. Is it an irreversible single-point-of-failure for the future of the universe? You bet! And that’s bad. But I’m currently pretty skeptical that we can avoid that kind of thing (e.g. I expect very low compute requirements that make AGI governance / anti-proliferation rather hopeless), so I’m more in the market for “least bad option” rather than “ideal plan”.
    What links here?
    Steven Byrnes's comment on “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (5 May 2025 14:55 UTC; 5 points)
    - Wei Dai 2 May 2025 0:15 UTC
      LW: 6 AF: 4
      0
      AF Parent
      
      If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.
      
      A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they’ll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
      Another problem is that we’re relatively confident that at least some humans can reason “successfully”, in the sense of making philosophical progress, but we don’t know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
      Finally, I find your arguments against moral realism somewhat convincing, but I’m still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don’t want to bet the universe on or against any of these positions.
      - Steven Byrnes 5 May 2025 14:55 UTC
        LW: 5 AF: 4
        0
        AF Parent
        Thanks!
        RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.
        I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.
        RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.
        RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)
        By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)
        Wei Dai 10 May 2025 7:33 UTC
        LW: 4 AF: 4
        0
        AF Parent
        
        For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.
        
        I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I’m worried that in actual development of brain-like AGI, we will skip this part (because it’s too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it’s brain-like. (And then it ends up not doing that because we didn’t give it some “secret sauce” that humans have.) And this does look fairly hard to me, because we don’t yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?
        
        But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].
        
        I find it pretty plausible that wireheading is what we’ll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we’ll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it’ll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.
        
        (Of course if the right values for us are to max out our wireheading, then we shouldn’t hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs’ values could conflict with ours.)
        
        I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.
        
        You’re prioritizing instilling AGI with the right social instincts, and arguing that’s very important for getting the AGI to eventually converge on values that we’d consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn’t matter what social instincts the AGI starts out with, what’s more important is that it has the capacity to eventually find and be motivated by objective moral truths.
        
        From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it’s no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.
        
        By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.
        
        Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.