Towards_Keeperhood comments on Steve Byrnes’s Shortform

Towards_Keeperhood 7 May 2025 20:49 UTC
1 point
0
Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about.
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist.
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
- Steven Byrnes 8 May 2025 19:25 UTC
  3 points
  0
  Parent
  Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
  Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
  the impressive long-term optimization happens mainly through expected utility guesses the world model makes
  The candy example involves good long-term planning right? But not explicit guesses of expected utility.
  …But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
  But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
  For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
  But the decision of what plan I end up pursuing doesn’t depend on the value function.
  The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
  (Sorry if I’m misunderstanding, here or elsewhere.)
  - Towards_Keeperhood 8 May 2025 20:31 UTC
    3 points
    0
    Parent
    The candy example involves good long-term planning right? But not explicit guesses of expected utility.
    (No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
    Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
    Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
    Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
    The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
    Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
    The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
    Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
    (Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
    How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
    - Steven Byrnes 12 May 2025 22:03 UTC
      4 points
      0
      Parent
      Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
      The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
      You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
      For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
      Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
      How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
      Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
      - Towards_Keeperhood 14 May 2025 7:46 UTC
        1 point
        0
        Parent
        Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
        Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire.
        I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
        Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
        The fictional character does cool stuff so you start to admire him.
        You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
        The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
        I still find myself a bit confused:
        Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
        In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
        (That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
        (Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
        I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.