Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 25 Apr 2025 18:54 UTC
LW: 5 AF: 3
0
AF
Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.
So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).
Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”
If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.
If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.
If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.
The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired, and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.
Does that help? Sorry if I missed your point.
- cousin_it 25 Apr 2025 22:59 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I think it helps. The link to “non-behaviorist rewards” seems the most relevant. The way I interpret it (correct me if I’m wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.
  
  The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?
  - Steven Byrnes 26 Apr 2025 1:37 UTC
    LW: 4 AF: 3
    0
    AF Parent
    The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store.
    “We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it.
    what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way.
    This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.
    - cousin_it 26 Apr 2025 10:50 UTC
      LW: 2 AF: 1
      0
      AF Parent
      I thought about it some more and want to propose another framing.
      
      The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent’s feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won’t even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.
      
      The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is. It’s a kind of RL for which the death of one creature is not the end. In other words, we can function because we’ve delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there’s a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)
      
      This suggests to me that if we want the rubber to meet the road—if we want the agent to have behaviors that track the world, not just the agent’s own feelings—then the optimization process that created the agent cannot be the agent’s own RL. By itself, RL can only learn to care about “behavioral reward” as you put it. Caring about the world can only occur if the agent “inherits” that caring from some other process in the world, by makeup or imitation.
      
      This conclusion might be a bit disappointing, because finding the right process to “inherit” from isn’t easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn’t the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can’t hope that the agents will learn it by some clever RL. It has to be due to the agent’s makeup or imitation.
      
      This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?
      - Steven Byrnes 27 Apr 2025 12:40 UTC
        LW: 2 AF: 2
        0
        AF Parent
        The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.
        I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing with that, but I’m not quite sure why.
        In other words, specification gaming is broader than wireheading. So for example, a “behaviorist reward” would be “+1 if the big red reward button on the wall gets pressed”. An agent with that reward, who is able to see the button, will probably wind up wanting the button to be pressed, and seizing control of the button if possible. But if the button is connected to a wire behind the wall, the agent won’t necessarily want to bypass the button by breaking open the wall and cutting the wire and shorting it. (If it did break open the wall and short the wire, then its value function would update, and it would get “addicted” to that, and going forward it would want to do that again, and things like it, in the future. But it won’t necessarily want to do that in the first place. This is an example of goal misgeneralization.)
        imagine an environment where any mistake kills the agent. In such an environment, RL is impossible. The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is.
        Evolution partly acts through RL. For example, falling off a cliff often leads to death, so we evolved fear of heights, which is an innate drive / primary reward that then leads (via RL) to flexible intelligent cliff-avoiding behavior.
        But also evolution has put a number of things into our brains that are not the main RL system. For example, there’s an innate reflex controlling how and when to vomit. (You obviously don’t learn how and when to vomit by trial-and-error!)
        here I disagree with you a bit: I think most of human learning is imitation
        When I said human learning is not “imitation learning”, I was using the latter term in a specific algorithmic sense as described in §2.3.2. I certainly agree that people imitate other people and learn from culture, and that this is a very important fact about humans. I just think it happens though the human brain within-lifetime RL system—humans generally imitate other people and copy culture because they want to, because it feels like a good and right thing to do.
        cousin_it 27 Apr 2025 14:24 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?
        
        In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that’s the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?
        Steven Byrnes 27 Apr 2025 20:43 UTC
        LW: 2 AF: 2
        0
        AF Parent
        even if during training it already knew that buttons are often connected to wires
        I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.
        A slightly smarter agent would turn its gaze slightly closer to the reward itself.
        I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.
        Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here.
        If that’s the plan, to me at first glance it seems a bit brittle.
        I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here.
        …But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here?
        If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans.
        And see also “The bio-determinist child-rearing rule of thumb” here which implies that innate drives can have predictable results in adult desires and personality, robust to at least some variation in training environment. [But more wild variation in training environment, e.g. feral children, does seem to matter.] And also Heritability, Behaviorism, and Within-Lifetime RL
        cousin_it 27 Apr 2025 22:16 UTC
        LW: 6 AF: 5
        0
        AF Parent
        My perspective (well, the one that came to me during this conversation) is indeed “I don’t want to take cocaine → human-level RL is not the full story”. That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I’m not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.
        
        It’s just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it’ll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it’ll start to realize that behind the button there’s a wire, and the wire leads to the agent’s own reward circuit and so on.
        
        Can you engineer things just right, so the agent learns to care about just the right level of “realness”? I don’t know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: “you’ll care about reality in this specific way”. So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the “realness”? That’s the point I was trying to make a couple comments ago, but maybe didn’t phrase it well.
        Steven Byrnes 28 Apr 2025 18:15 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.
        (In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in the future.)
        Wireheading is indeed an attractor, just like getting hooked on an addictive drug is an attractor. As soon as you try it, your value function will update, and then you’ll want to do it again. But before you try it, your value function has not updated, and it’s that not-updated value function that gets to evaluate whether taking an addictive drug is a good plan or bad plan. See also my discussion of “observation-utility agents” here. I don’t think you can get hooked on addictive drugs just by deeply understanding how they work.
        So by the same token, it’s possible for our hypothetical agent to think that the pressing of the actual wired-up button is the best thing in the world. Cutting into the wall and shorting the wire would be bad, because it would destroy the thing that is best in the world, while also brainwashing me to not even care about the button, which adds insult to injury. This isn’t a false belief—it’s an ought not an is. I don’t think it’s reflectively-unstable either.