Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
[the value function] only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate.
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let’s say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it’s value estimate.
Or perhaps a slightly more concrete version of how that may happen. (I’m thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):
Sometimes the thought generator generates self-reflective thoughts like “what are my goals here”, where upon the thought generator produces an answer “X” to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking “what are my goals here”.
The same can happen with “what are my long-term goals”, where the thought generator might guess something that would cause high reward.
For humans, X is likely more socially nice than would be expected from the value function, since “X are my goals here” is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1]
Later the thought generator may generate the thought “make careful predictions whether the plan will actually accomplish the stated goals well”, where upon the thought generator often finds some incoherences that the value function didn’t notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like “make careful predictions whether the plan will actually accomplish the stated goals well”.
Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator’s predictions often were better. So over time the value function gets updated more and more toward “take the predictions of the thought generator as our valence guess”, since that strategy better predicts later valence guesses.
Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like “studying rationality may make me better at accomplishing my goals”, causing the agent to study rationality.
And also thoughts like “making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal”.
And maybe later more advanced bootstrapping through thoughts like “understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals”. Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it’s too useless IMO).
So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us viausing its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us.
So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it’s probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function. For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it’s plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.
Basically, whenever someone thinks “what’s actually my goal here”, I would say that’s already a slight departure from “using one’s model-based RL capabilities in the way we normally expect”. Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn’t describe my own brain as “using one’s model-based RL capabilities in the way we normally expect”. I’m not at all sure about this, but I would expect that “using its model-based RL capabilities in the way we normally expect” won’t get us to pivotal level of capability if the value function is primitive.
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further).
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.
Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind.
Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.)
Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past.
So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We automatically get plausible and immediate candy-eating plans getting a lot of valence / motivational force, while implausible, distant, and abstract candy-eating plans don’t feel so motivating.
Does that help? (I started writing a response to the rest of what you wrote, but maybe it’s better if I pause there and see what you think.)
Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it’s a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)
So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.
Let’s look at how an agent might accomplish a very difficult goal, where the agent didn’t accomplish similar goals yet so the value function doesn’t already assign higher valence to subgoals:
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
It’s easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like “apply the ‘clarify goal’ strategy to make progress towards the subgoal ‘evaluate whether training for corrigibility might work to safely perform a pivotal act’, which has expected utility X”.
So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward.
The expected utility could be calculated by having the world model see what value (aka expected reward/utility) the value function assigns to the endgoal, and then backpropagating expected utility estimates for subgoals based on how likely and given what resources the larger goal could be accomplished given the smaller goal.
However, the value function is stupid and often not very coherent given some simplicity assumptions of the world model. E.g. the valence of the outcome “1000 lives get saved” isn’t 1000x higher than of “1 life gets saved”.
So the world model’s expected utility estimates come apart from the value function estimates. And it seems to me that for very smart and reflective people, which difficult goals they achieve depend more on their world model’s expected utility guesses, rather than their value function estimates. So I wouldn’t call it “the agent works as we expect model-based RL agents to work”. (And I expect this kind of “the world model assigns expected utility guesses” may be necessary to get to pivotal capability if the value function is simple, though not sure.)
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world.
Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end.
Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences, unless the person is Machiavellian or something.
Speaking of which, there’s a fiction trope that basically only villains are allowed to make plans and display intelligence. The way to write a hero in (non-rationalist) fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and the former wins out over the latter.
To be clear, I’m not accusing you of failing to do things with good long-term consequences because they have good long-term consequences. Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about. So then you get the immediate intrinsic motivation by doing that kind of work, and yet it’s also true that you’re sincerely working towards consequences that are (hopefully) good. And then some more narrow projects towards that end can also wind up feeling socially good (and hence become intrinsically rewarding, even without explicitly holding their long-term consequences in mind), etc.
the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward
I don’t think this is necessary per above, but I also don’t think it’s realistic. The value function updating rule is something like TD learning, a simple equation / mechanism, not an intelligent force with foresight. (Or sorry if I’m misunderstanding. I didn’t really follow this part or the rest of your comment :( But I can try again if it’s important.)
Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about.
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist.
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
the impressive long-term optimization happens mainly through expected utility guesses the world model makes
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
…But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
But the decision of what plan I end up pursuing doesn’t depend on the value function.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
(Sorry if I’m misunderstanding, here or elsewhere.)
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
(No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
(Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.
Thanks!
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let’s say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it’s value estimate.
Or perhaps a slightly more concrete version of how that may happen. (I’m thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):
Sometimes the thought generator generates self-reflective thoughts like “what are my goals here”, where upon the thought generator produces an answer “X” to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking “what are my goals here”.
The same can happen with “what are my long-term goals”, where the thought generator might guess something that would cause high reward.
For humans, X is likely more socially nice than would be expected from the value function, since “X are my goals here” is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1]
Later the thought generator may generate the thought “make careful predictions whether the plan will actually accomplish the stated goals well”, where upon the thought generator often finds some incoherences that the value function didn’t notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like “make careful predictions whether the plan will actually accomplish the stated goals well”.
Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator’s predictions often were better. So over time the value function gets updated more and more toward “take the predictions of the thought generator as our valence guess”, since that strategy better predicts later valence guesses.
Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like “studying rationality may make me better at accomplishing my goals”, causing the agent to study rationality.
And also thoughts like “making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal”.
And maybe later more advanced bootstrapping through thoughts like “understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals”. Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it’s too useless IMO).
So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:
So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it’s probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function.
For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it’s plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.
Basically, whenever someone thinks “what’s actually my goal here”, I would say that’s already a slight departure from “using one’s model-based RL capabilities in the way we normally expect”. Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn’t describe my own brain as “using one’s model-based RL capabilities in the way we normally expect”. I’m not at all sure about this, but I would expect that “using its model-based RL capabilities in the way we normally expect” won’t get us to pivotal level of capability if the value function is primitive.
If I just trust my model of your model here. (Though I might misrepresent your model. I would need to reread your posts.)
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.
Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind.
Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.)
Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past.
So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We automatically get plausible and immediate candy-eating plans getting a lot of valence / motivational force, while implausible, distant, and abstract candy-eating plans don’t feel so motivating.
Does that help? (I started writing a response to the rest of what you wrote, but maybe it’s better if I pause there and see what you think.)
Thanks.
Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it’s a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)
So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.
Let’s look at how an agent might accomplish a very difficult goal, where the agent didn’t accomplish similar goals yet so the value function doesn’t already assign higher valence to subgoals:
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
It’s easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like “apply the ‘clarify goal’ strategy to make progress towards the subgoal ‘evaluate whether training for corrigibility might work to safely perform a pivotal act’, which has expected utility X”.
So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward.
The expected utility could be calculated by having the world model see what value (aka expected reward/utility) the value function assigns to the endgoal, and then backpropagating expected utility estimates for subgoals based on how likely and given what resources the larger goal could be accomplished given the smaller goal.
However, the value function is stupid and often not very coherent given some simplicity assumptions of the world model. E.g. the valence of the outcome “1000 lives get saved” isn’t 1000x higher than of “1 life gets saved”.
So the world model’s expected utility estimates come apart from the value function estimates. And it seems to me that for very smart and reflective people, which difficult goals they achieve depend more on their world model’s expected utility guesses, rather than their value function estimates. So I wouldn’t call it “the agent works as we expect model-based RL agents to work”.
(And I expect this kind of “the world model assigns expected utility guesses” may be necessary to get to pivotal capability if the value function is simple, though not sure.)
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing.
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world.
Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end.
Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences, unless the person is Machiavellian or something.
Speaking of which, there’s a fiction trope that basically only villains are allowed to make plans and display intelligence. The way to write a hero in (non-rationalist) fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and the former wins out over the latter.
To be clear, I’m not accusing you of failing to do things with good long-term consequences because they have good long-term consequences. Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about. So then you get the immediate intrinsic motivation by doing that kind of work, and yet it’s also true that you’re sincerely working towards consequences that are (hopefully) good. And then some more narrow projects towards that end can also wind up feeling socially good (and hence become intrinsically rewarding, even without explicitly holding their long-term consequences in mind), etc.
I don’t think this is necessary per above, but I also don’t think it’s realistic. The value function updating rule is something like TD learning, a simple equation / mechanism, not an intelligent force with foresight. (Or sorry if I’m misunderstanding. I didn’t really follow this part or the rest of your comment :( But I can try again if it’s important.)
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
…But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
(Sorry if I’m misunderstanding, here or elsewhere.)
(No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
(Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.