Towards_Keeperhood comments on “The Era of Experience” has an unsolved technical alignment problem

Towards_Keeperhood 25 Apr 2025 14:15 UTC
3 points
0
I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.
For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design.
Thanks.
I’m not sure I fully understand what you’re trying to say here. The “100,000 years ago” suggests to me you’re talking about evolution, but then at the end you’re comparing it to the human innate reward function, rather than genetic fitness.
I agree that humans do a lot of stuff that triggers the human innate reward function. By “a complex proxy for predicting reward which misgeneralizes”, I don’t mean the AI winds up with a goal that disagrees with the reward signal, but rather that it probably learns one of those many many goals that are compatible with the reward function, but it happens to not be in the narrow cluster of reward-compatible goals that we hoped for. (One could say narrowing the compatible goals is part of reward specification, but I rather don’t, because I don’t think it’s a practical avenue to try to get a reward function that can precisely predict how well some far-out-of-distribution outcomes (e.g. what kind of sentient beings to create when we’re turning the stars into cities) align with human’s coherent extrapolated volition.)
(If we ignore evolution and only focus on alignment relative to the innate reward function, then) the examples you mentioned (“playing video games,...”) are still sufficiently on-distribution that the reward function says something about those, and failing here is not the main failure mode I worry about.
The problem is that human values are not only about what normal humans value in everyday life, but also about what they would end up valuing if they became smarter. E.g. I want to fill the galaxies with lots of computronium simulating sentient civilizations living happy and interesting lives, and this is one particular goal that is compatible with the human reward function, but there are many others possible reward-compatible goals. An AI that has similar values as humans at +0SD intelligence, might end up valuing something very different from +7SD humans at +7SD, because it may have different underlying priors for doing philosophical value reflection. This would be bad.
We need to create AIs that can solve very difficult problems, and just reasoning about how alignment works in normal people isn’t sufficient here. Though understand how Steven Byrnes’ values formed might be sorta sufficient. I think you have beliefs like “I want all sentient life not to suffer”, rather than just “I want humans in fleshy bodies not to suffer”, even though if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
The way I see it, beliefs like “I want all sentient life not to suffer” steer your behavior, because the value function learned heuristics of “allow abstract goal-oriented reasoning”, even though most abstract thoughts like “I need to figure out how human social instincts work” are meaningless to the value function (though maybe you disagree?). I think most powerful optimization in smart humans comes from such goal-directed reasoning, rather than the value function doing most of the planning work.
IIUC you have an alignment framing of “the value function needs to capture our preferences”, but I think for sufficiently smart AIs this isn’t a good framing because I think the value function will promote goal-oriented thinking strategies, which will be responsible for the main optimization power and might cause the AI to get more precise goals which aren’t those that we wanted the AI to have.
I would be curious on your thoughts here.
(Also tbc, the final goal of the superintelligence may also end up incompatible with the reward on the training distribution. E.g. it may just care about an unbounded interpretation of a particular subgoal that got optimized hard enough for some more efficient planning algorithms toward that goal to emerge, which include reflectivity and realize self-preservation incentive, and then that part became deceptive and handled more and more other jobs until it got enough reward to take over the AI. But that’s a different point.)
Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talking about agents with continuous online learning.
I don’t think you’re imagining properly OOD cases, only “slightly new cases”. The human innate reward function doesn’t hit back whether I conclude “I value sentient minds regardless on what substrate they are running on” or “I value there being happy humans out of flesh and bone in the universe”.
Human feedback also doesn’t work. Quote:
The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators. This already creates its own predictable problems, such as style-over-substance and flattery. This method breaks down completely, however, when AI starts working on problems where humans aren’t smart enough to fully understand the system’s proposed solutions, including the long-term consequences of superhumanly sophisticated plans and superhumanly complex inventions and designs.
Aka we cannot judge whether what an actual smart AI is doing is in our interest. Also we don’t know our coherent extrapolated values.
I think we would need to ensure the AI winds up corrigibly aligned to the CEV of humans, and this is not something that can be specified through reward.
Or what kind of reward function do you have in mind?
- Steven Byrnes 29 Apr 2025 17:34 UTC
  3 points
  0
  Parent
  I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B).
  There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like corrigibility and common-sense helpfulness, including checking in before doing irreversible things , and helping with alignment research and oversight. I think this is a rather popular school of thought these days, and is one of the major reasons why the median P(doom) among alignment researchers is probably “only” 20% or whatever, as opposed to much higher. I’m not sure whether I buy that school of thought or not. I’ve been mulling it over and am hoping to discuss it in a forthcoming post. (But it’s moot if we can’t even solve (A).)
  Regardless, I’m allowed to talk about how (A) is a problem, whether or not (B) is also a problem. :)
  if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
  I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.
  …I might respond to the rest of your comment in our other thread (when I get a chance).
  - Towards_Keeperhood 30 Apr 2025 11:41 UTC
    1 point
    0
    Parent
    Thanks! It’s nice that I’m learning more about your models.
    I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”.
    (A) seems much more general than what I would call “reward specification failure”.
    The way I use “reward specification” is:
    If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it better fits the reward data, then it’s a reward specification problem.
    If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it fits the reward data about similarly well and it’s the simpler goal given the architecture, it’s NOT a reward specification problem.
    (This doesn’t seem to me to fit your description of “B”.)
    (Related.)
    I might count the following as reward specification problem, but maybe not, maybe another name would be better:
    The AI mostly gets reward for solving problems which aren’t much about human values specifically, so the AI may mainly learn to value insights for solving problems better rather than human values.
    (B) seems to me like an overly specific phrasing, and there are many stages where misgeneralization may happen:
    when the AI transitions to thinking in goal-directed ways (instead of following more behavioral heuristics or value function estimates)
    when the AI starts modelling itself and forms a model of what values it has (where the model might mismatch what is optimized on the object level)
    when the AI’s ontology changes and it needs to decide how to rebind value-laden concepts
    when the AI encounters philosophical problems like Pascal’s mugging
    Section 4 of Jeremy’s and Peter’s report also shows some more ways of how an AI might fail to learn the intended goal without being due to reward specification^[1], though it doesn’t use your model-based RL frame.
    Also, I don’t think A and B are exhaustive. Other somewhat speculative problems include:
    A mesaoptimizer emerges under selection pressure and tries to gain control of the larger AI it is in while staying undetected. (Sorta like cancer for the mind of the AI.)
    A special case of this might come from the AI trying to imagine another mind in detail, and the other mind might notice it is simulated and try to take control of the AI.
    The AI might make a mistake when designing a more efficient successor AI on a different AI paradigm (especially because it may get pressured by humans into trying to do it quickly because of AI race), so the successor AI ends up with different values.
    Other stuff I haven’t thought of now
    Tbc, there’s no individual point where I think failure is overwhelmingly likely by default, but overall failure is disjunctive.
    if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
    I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.
    Interesting that you think this.
    Having quite good interpretability that we can use to give reward would definitely make me significantly more optimistic.
    Though AIs might learn to think thoughts in different formats that don’t trigger negative reward, as e.g. in the “Deep deceptiveness” story.
    ^
    Aka some inner alignment (aka goal misgeneralization) failure modes, though I don’t know whether I want to use those words, because it’s actually a huge bundle of problems.
    What links here?
    Towards_Keeperhood's comment on Towards_Keeperhood’s Shortform by Towards_Keeperhood (17 May 2025 12:03 UTC; 13 points)
    Towards_Keeperhood's comment on Consequentialism & corrigibility by Steven Byrnes (17 May 2025 9:00 UTC; 3 points)
    - Steven Byrnes 30 Apr 2025 14:53 UTC
      3 points
      0
      Parent
      Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.
      I didn’t intend (A) & (B) to be a precise and complete breakdown.
      AIs might learn to think thoughts in different formats
      Yeah that’s definitely a thing to think about. Human examples might include “compassion fatigue” (shutting people out because it’s too hard to feel for them); or my theory that many people with autism learn to deliberately unconsciously avoid a wide array of innate social reactions from a young age; or choosing spending more and more time and mental space with imaginary friends, virtual friends, teddy bears, movies, etc. instead of real people. There are various tricks to mitigate these kinds of complications, and they seem to work well enough in human brains. So I think it’s premature to declare that this problem is definitely unsolvable. (And I think the Deep Deceptiveness post is too simplistic, see my comment on it.)
      - Towards_Keeperhood 30 Apr 2025 15:02 UTC
        3 points
        0
        Parent
        Thx.
        Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff.
        I don’t really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won’t have sufficient understanding to judge whether what the AI did is actually good.
        Steven Byrnes 30 Apr 2025 15:18 UTC
        2 points
        0
        Parent
        (1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.