Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 29 Apr 2025 17:34 UTC
3 points
0
I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B).
There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like corrigibility and common-sense helpfulness, including checking in before doing irreversible things , and helping with alignment research and oversight. I think this is a rather popular school of thought these days, and is one of the major reasons why the median P(doom) among alignment researchers is probably “only” 20% or whatever, as opposed to much higher. I’m not sure whether I buy that school of thought or not. I’ve been mulling it over and am hoping to discuss it in a forthcoming post. (But it’s moot if we can’t even solve (A).)
Regardless, I’m allowed to talk about how (A) is a problem, whether or not (B) is also a problem. :)
if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.
…I might respond to the rest of your comment in our other thread (when I get a chance).
- Towards_Keeperhood 30 Apr 2025 11:41 UTC
  1 point
  0
  Parent
  Thanks! It’s nice that I’m learning more about your models.
  I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”.
  (A) seems much more general than what I would call “reward specification failure”.
  The way I use “reward specification” is:
  - If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it better fits the reward data, then it’s a reward specification problem.
  - If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it fits the reward data about similarly well and it’s the simpler goal given the architecture, it’s NOT a reward specification problem.
    (This doesn’t seem to me to fit your description of “B”.)
    (Related.)
  I might count the following as reward specification problem, but maybe not, maybe another name would be better:
  - The AI mostly gets reward for solving problems which aren’t much about human values specifically, so the AI may mainly learn to value insights for solving problems better rather than human values.
  (B) seems to me like an overly specific phrasing, and there are many stages where misgeneralization may happen:
  - when the AI transitions to thinking in goal-directed ways (instead of following more behavioral heuristics or value function estimates)
  - when the AI starts modelling itself and forms a model of what values it has (where the model might mismatch what is optimized on the object level)
  - when the AI’s ontology changes and it needs to decide how to rebind value-laden concepts
  - when the AI encounters philosophical problems like Pascal’s mugging
  Section 4 of Jeremy’s and Peter’s report also shows some more ways of how an AI might fail to learn the intended goal without being due to reward specification^[1], though it doesn’t use your model-based RL frame.
  Also, I don’t think A and B are exhaustive. Other somewhat speculative problems include:
  - A mesaoptimizer emerges under selection pressure and tries to gain control of the larger AI it is in while staying undetected. (Sorta like cancer for the mind of the AI.)
    A special case of this might come from the AI trying to imagine another mind in detail, and the other mind might notice it is simulated and try to take control of the AI.
  - The AI might make a mistake when designing a more efficient successor AI on a different AI paradigm (especially because it may get pressured by humans into trying to do it quickly because of AI race), so the successor AI ends up with different values.
  - Other stuff I haven’t thought of now
  Tbc, there’s no individual point where I think failure is overwhelmingly likely by default, but overall failure is disjunctive.
  if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
  I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.
  Interesting that you think this.
  Having quite good interpretability that we can use to give reward would definitely make me significantly more optimistic.
  Though AIs might learn to think thoughts in different formats that don’t trigger negative reward, as e.g. in the “Deep deceptiveness” story.
  1. ^
    Aka some inner alignment (aka goal misgeneralization) failure modes, though I don’t know whether I want to use those words, because it’s actually a huge bundle of problems.
  What links here?
  - Towards_Keeperhood's comment on Towards_Keeperhood’s Shortform by Towards_Keeperhood (17 May 2025 12:03 UTC; 13 points)
  - Towards_Keeperhood's comment on Consequentialism & corrigibility by Steven Byrnes (17 May 2025 9:00 UTC; 3 points)
  - Steven Byrnes 30 Apr 2025 14:53 UTC
    3 points
    0
    Parent
    Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.
    I didn’t intend (A) & (B) to be a precise and complete breakdown.
    AIs might learn to think thoughts in different formats
    Yeah that’s definitely a thing to think about. Human examples might include “compassion fatigue” (shutting people out because it’s too hard to feel for them); or my theory that many people with autism learn to deliberately unconsciously avoid a wide array of innate social reactions from a young age; or choosing spending more and more time and mental space with imaginary friends, virtual friends, teddy bears, movies, etc. instead of real people. There are various tricks to mitigate these kinds of complications, and they seem to work well enough in human brains. So I think it’s premature to declare that this problem is definitely unsolvable. (And I think the Deep Deceptiveness post is too simplistic, see my comment on it.)
    - Towards_Keeperhood 30 Apr 2025 15:02 UTC
      3 points
      0
      Parent
      Thx.
      Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff.
      I don’t really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won’t have sufficient understanding to judge whether what the AI did is actually good.
      - Steven Byrnes 30 Apr 2025 15:18 UTC
        2 points
        0
        Parent
        (1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.