Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.