cousin_it comments on “The Era of Experience” has an unsolved technical alignment problem

cousin_it 26 Apr 2025 10:50 UTC
LW: 2 AF: 1
0
AF
I thought about it some more and want to propose another framing.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent’s feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won’t even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.

The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is. It’s a kind of RL for which the death of one creature is not the end. In other words, we can function because we’ve delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there’s a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)

This suggests to me that if we want the rubber to meet the road—if we want the agent to have behaviors that track the world, not just the agent’s own feelings—then the optimization process that created the agent cannot be the agent’s own RL. By itself, RL can only learn to care about “behavioral reward” as you put it. Caring about the world can only occur if the agent “inherits” that caring from some other process in the world, by makeup or imitation.

This conclusion might be a bit disappointing, because finding the right process to “inherit” from isn’t easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn’t the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can’t hope that the agents will learn it by some clever RL. It has to be due to the agent’s makeup or imitation.

This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?
- Steven Byrnes 27 Apr 2025 12:40 UTC
  LW: 2 AF: 2
  0
  AF Parent
  The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.
  I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing with that, but I’m not quite sure why.
  In other words, specification gaming is broader than wireheading. So for example, a “behaviorist reward” would be “+1 if the big red reward button on the wall gets pressed”. An agent with that reward, who is able to see the button, will probably wind up wanting the button to be pressed, and seizing control of the button if possible. But if the button is connected to a wire behind the wall, the agent won’t necessarily want to bypass the button by breaking open the wall and cutting the wire and shorting it. (If it did break open the wall and short the wire, then its value function would update, and it would get “addicted” to that, and going forward it would want to do that again, and things like it, in the future. But it won’t necessarily want to do that in the first place. This is an example of goal misgeneralization.)
  imagine an environment where any mistake kills the agent. In such an environment, RL is impossible. The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is.
  Evolution partly acts through RL. For example, falling off a cliff often leads to death, so we evolved fear of heights, which is an innate drive / primary reward that then leads (via RL) to flexible intelligent cliff-avoiding behavior.
  But also evolution has put a number of things into our brains that are not the main RL system. For example, there’s an innate reflex controlling how and when to vomit. (You obviously don’t learn how and when to vomit by trial-and-error!)
  here I disagree with you a bit: I think most of human learning is imitation
  When I said human learning is not “imitation learning”, I was using the latter term in a specific algorithmic sense as described in §2.3.2. I certainly agree that people imitate other people and learn from culture, and that this is a very important fact about humans. I just think it happens though the human brain within-lifetime RL system—humans generally imitate other people and copy culture because they want to, because it feels like a good and right thing to do.
  - cousin_it 27 Apr 2025 14:24 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?
    
    In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that’s the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?
    - Steven Byrnes 27 Apr 2025 20:43 UTC
      LW: 2 AF: 2
      0
      AF Parent
      even if during training it already knew that buttons are often connected to wires
      I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.
      A slightly smarter agent would turn its gaze slightly closer to the reward itself.
      I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.
      Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here.
      If that’s the plan, to me at first glance it seems a bit brittle.
      I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here.
      …But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here?
      If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans.
      And see also “The bio-determinist child-rearing rule of thumb” here which implies that innate drives can have predictable results in adult desires and personality, robust to at least some variation in training environment. [But more wild variation in training environment, e.g. feral children, does seem to matter.] And also Heritability, Behaviorism, and Within-Lifetime RL
      - cousin_it 27 Apr 2025 22:16 UTC
        LW: 6 AF: 5
        0
        AF Parent
        My perspective (well, the one that came to me during this conversation) is indeed “I don’t want to take cocaine → human-level RL is not the full story”. That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I’m not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.
        
        It’s just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it’ll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it’ll start to realize that behind the button there’s a wire, and the wire leads to the agent’s own reward circuit and so on.
        
        Can you engineer things just right, so the agent learns to care about just the right level of “realness”? I don’t know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: “you’ll care about reality in this specific way”. So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the “realness”? That’s the point I was trying to make a couple comments ago, but maybe didn’t phrase it well.
        Steven Byrnes 28 Apr 2025 18:15 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.
        (In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in the future.)
        Wireheading is indeed an attractor, just like getting hooked on an addictive drug is an attractor. As soon as you try it, your value function will update, and then you’ll want to do it again. But before you try it, your value function has not updated, and it’s that not-updated value function that gets to evaluate whether taking an addictive drug is a good plan or bad plan. See also my discussion of “observation-utility agents” here. I don’t think you can get hooked on addictive drugs just by deeply understanding how they work.
        So by the same token, it’s possible for our hypothetical agent to think that the pressing of the actual wired-up button is the best thing in the world. Cutting into the wall and shorting the wire would be bad, because it would destroy the thing that is best in the world, while also brainwashing me to not even care about the button, which adds insult to injury. This isn’t a false belief—it’s an ought not an is. I don’t think it’s reflectively-unstable either.