Suppose you’re training a huge neural network with some awesome future RL algorithm with clever exploration bonuses and a self-supervised pretrained multimodal initialization and a recurrent state. This NN implements an embodied agent which takes actions in reality (and also in some sim environments). You watch the agent remotely using a webcam (initially unbeknownst to the agent). When the AI’s activities make you smile, you press the antecedent-computation-reinforcer button (known to some as the “reward” button). The agent is given some appropriate curriculum, like population-based self-play, so as to provide a steady skill requirement against which its intelligence is sharpened over training. Supposing the curriculum trains these agents out until they’re generally intelligent—what comes next?
The standard response is “One or more of the agents gets smart, does a treacherous turn, kills you, and presses the reward button forever.”
But reward is not the optimization target. This story isn’t impossible, but I think it’s pretty improbable, and definitely not a slam-dunk.
Another response is “The AI paralyzes your face into smiling.”
But this is actually a highly nontrivial claim about the internal balance of value and computation which this reinforcement schedule carves into the AI. Insofar as this response implies that an AI will primarily “care about” literally making you smile, that seems like a highly speculative and unsupported claim about the AI internalizing a single powerful decision-relevant criterion / shard of value, which also happens to be related to the way that humans conceive of the situation (i.e. someone is being made to smile).
My current answer is “I don’t know precisely what goes wrong, but probably something does, but also I suspect I could write down mechanistically plausible-to-me stories where things end up bad but not horrible.” I think the AI will very probably have a spread of situationally-activated computations which steer its actions towards historical reward-correlates (e.g. if near a person, then tell a joke), and probably not singularly value e.g. making people smile or reward. Furthermore, I think its values won’t all map on to the “usual” quantities-of-value:
80% credence: It’s very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like
# of diamonds
or# of happy people
orreward-signal
) and not quantities like (# of times I have to look at a cube in a blue room
or-1 * subjective micromorts accrued
).
So, I’m pretty uncertain about what happens here, but would guess that most other researchers are less uncertain than I am. So here’s an opportunity for us to talk it out!
(My mood here isn’t “And this is what we do for alignment, let’s relax.” My mood is “Why consider super-complicated reward and feedback schemes when, as far as I can tell, we don’t know what’s going to happen in this relatively simple scheme? How do reinforcement schedules map into inner values?”)
I think the main concept missing here is compression: trained systems favor more compact policies/models/heuristics/algorithms/etc. The fewer parameters needed to implement the inner agent, the more parameters are free to vary, and therefore the more parameter-space-volume the agent takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)
The outer training loop doesn’t just select for high reward, it also implicitly selects for compactness. We expect it to find, not just policies which achieve high reward, but policies which are very compactly represented.
Compression is the main reason we expect inner search processes to appear. Here’s the relevant argument from Risks From Learned Optimization:
The same argument applies to the terminal objectives/heuristics/proxies instilled in an RL-trained system: it may not terminally value the reward button being pushed or the human smiling or whatever, but its values should be generated from a relatively small, relatively simple set of things. For instance, a plausible Fermi estimate for humans is that our values are ultimately generated from ~tens of simple proxies. (And I would guess that modern ML training would probably result in even fewer, relative to human evolution.)
Furthermore, whatever terminal values are instilled in the RL-trained system, they do need to at least induce near-perfect optimization of the feedback signal on the training set; otherwise the outer training loop would select some other parameters. The outer training loop is still an optimization process, after all, so whatever policy the trained system ends up with should still be roughly-optimal. (There’s some potential wiggle room here insofar as the AI which takes off will be the first one to pass the threshold, and that may happen during a training run before convergence, but I think that’s probably not central to discussion here?)
Putting that all together: we don’t know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you’re literally training an AI just on smiling, TurnTrout is right that “a spread of situationally-activated computations” is more likely since you’re not getting any value from the generality of search.
The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don’t know of explicit evidence for the conjecture that “explicit search + simple objectives” is actually lower complexity (in model space) than “bundle of heuristics”. Seems intuitive if model complexity is something close to Kolmogorov complexity, but would love to see an empirical investigation!
Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn’t select for inner search processes, because it’s not actually biased towards low Kolmogorov complexity. More in my standalone post, and here’s a John Maxwell comment making a similar point.
Not sure if I disagree with the object-level assertion, but I think some important caveats are missing here. We have to take the plausible paths through algorithm-space the SGD is likely to take as well, and that might change the form of the final compressed policy in non-intuitive ways.
Another compact policy is “a superintelligence with a messy slew of values that figured out the training context and maneuvered the SGD around to learn the reward function without internalizing it + compress itself while keeping its messy values static”, and I think it’s a probable-enough end-point.
It’s still likely that the “messy slew of values” won’t be that messy and will be near-perfect correlates for the reward, but given some (environment structure, reward) pairs, neither may be true. E. g., if the setup is such that strategic intelligence somehow develops well before the AI achieves optimal performance on the training set, then that intelligence will set in stone proxy objectives that aren’t good correlates of the reward.
Seems like you can have a yet-simpler policy by factoring the fixed “simple objective(s)” into implicit, modular elements that compress many different objectives that may be useful across many different environments. Then at runtime, you feed the environmental state into your factored representation of possible objectives and produce a mix of objectives tailored to your current environment, which steer towards behaviors that achieved high reward on training runs similar to the current environment.
That would seem quite close to “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, and it seems pretty similar to how my own values / goals arise in an environmentally-dependent manner without me having access to any explicitly represented “simple objective(s)” that I retain across environments.
That seems like a semantical difference? We may just as well call these modular elements the “objectives”, with them having different environment-specific local implementations.
E. g., if my goal is “winning”, it would unfold into different short-term objectives depending on whether I’m playing chess or football, but we can still meaningfully call it a “goal”.
I’m confident that this is not a semantic difference. The modular elements I was describing represent a process for determining ones objectives, depending on the environment and your current beliefs. It would be a type error to call them “objectives”, just as it would be a type error to call a search process your “plans”. They each represent compressions of possible objectives / plans, but are not those things themselves.
Similarly, it would be incorrect to call a GPT model a “collection of sentences”, even though they are essentially compressions over many possible sentences.
Okay, suppose we feed many environment-states into some factored representation of possible objectives, and generate a lot of (environment, objectives) mappings for a given agent. In your model, is it possible to summarize these results somehow; is it possible to say something general about what the agent is trying to do in all of these environments? (E. g., like my football & chess example.)
Yes, it’s possible to do summary statistics on the outputted goals, just like you can do summary statistics on the outputs of GPT-3, or in the plans produced by a given search algorithm. That doesn’t make generators of these things have the same type signature as the things themselves.
My counterpoint to John is specifically about the sort of computational structures that can represent goals, while being both simple AND environment/belief-dependent. I’m saying simplicity does not push against representing goals in an environment-dependent way, because your generator of goals can be conditioned on the environment.
How “meaningful” would that summary be? Does my “winning at chess vs football” analogy fit what you’re describing, with “winning” being the compressed objective-generator and the actual win conditions of chess/football being the environment-specific objectives?
My point is that you can have “goals” (things your search process steers the world towards) and “generators of goals”. These are different things, and you should not use the same name for them.
More specifically, there is a difference in the computational type signature between generators and the things they generate. You can call these two things by whatever label you like, but they are not the same thing.
You can look a person’s plans / behavior in many different games and conclude that it demonstrates a common thread which you might label “winning”. But you should not call the latent cognitive generators responsible for this common thread by the same name you use for the world states the person’s search process steers towards in different environments.
Alright, then it is a semantics debate from my perspective. I don’t think we’re actually disagreeing, now. Your “objective-generators” cleanly map to my “goals”, and your “objectives” to my “local implementations of goals” (or maybe “values” and “local interpretations of values”). That distinction definitely makes sense at the ground level. In my ontology, it’s a distinction between what you want and how achieving it looks like in a given situation.
I think it makes more sense to describe it my way, though, since I suspect a continuum of ever-more-specific/local objectives (“winning” as an environment-independent goal, “winning” in this type of game, “winning” against the specific opponent you have, “winning” given this game and opponent and the tactic they’re using), rather than a dichotomy of “objective-generator” vs “objective”, but that’s a finer point.
Although, digging into the previously-mentioned finer points, I think there is room for some meaningful disagreement.
I don’t think there are goal-generators as you describe them. I think there are just goals, and then some plan-making/search mechanism which does goal translation/adaptation/interpretation for any given environment the agent is in. I. e., the “goal generators” are separate pieces from the “ur-goals” they take as input.
And as I’d suggested, there’s a continuum of ever-more specific objectives. In this view, I think the line between “goals” and “plans” blurs, even, so that the most specific “objectives” are just “plans”. In this case, the “goal generator” is just the generic plan-making process working in a particular goal-interpreting regime.
(Edited-in example: “I want to be a winner” → “I want to win at chess” → “I want to win this game of chess” → “I want to decisively progress towards winning in this turn” → “I want to make this specific move”. The early steps here are clear examples of goal-generation/translation (what does winning mean in chess?), the latter clear examples of problem-solving (how do I do well this turn?), but they’re just extreme ends of a continuum.)
The initial goal-representations from which that process starts could be many things — mathematically-precise environment-independent utility functions, or goals defined over some default environment (as I suspect is the case with humans), or even step-one objective-generators, as you’re suggesting. But the initial representation being an objective-generator itself seems like a weirdly special case, not how this process works in general.
Can you explain why this policy is yet-simpler? It sounds more complicated to me.
I’m saying that it’s simpler to have a goal generator that can be conditioned on the current environment, rather than memorizing each goal individually.
That sure does sound like a description of a search algorithm, right there.
I’m not objecting to your assertion that some sort of search takes place. I’m objecting to your characterization of what sorts of objectives the search ends up pointed towards. Basically, I’m saying that “situationally activated heuristics that steer towards environment-dependent goals” is totally in-line with a simplicity prior over cognitive structures leading to a search-like process.
The whole reason you say that we should expect search processes is because they can compress many different environment and beliefs dependent plans into a simpler generator of such plans (the search), which takes in environment info, beliefs, and the agent’s simple, supposedly environment-independent, objectives, and produces a plan. So, the agent only needs to store the search process and its environment-independent objectives.
I’m saying you can apply a similar “compress into an environment / beliefs conditioned generator” trick to the objectives as well, and get a generator of objectives that condition on the environment and current beliefs to produce objectives for the search process.
Thus, objectives remain environment-dependent, and will probably steer towards world states that resemble those which were rewarded during training. I think this is quite similar to “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, if involving rather more sophisticated cognition than phrases like “contextually activated heuristics” often imply.
Here’s what I think you mean by an explicit search process:
In every situation, the neural network runs e.g. MCTS with a fixed leaf evaluation function (the simple objective).
On this understanding of your argument, I would be surprised if it went through. Here are a few quick counterpoints.
Outside tiny maze environments, constantly running search with a fixed objective is downright stupid, you’re going to constantly time out; anytime guarantees won’t necessarily save you, they’ll probably be weak or nonexistent; constantly running search will consistently waste computation time which could have been saved by caching computations and then thinking about other things during the rest of the forward pass (aka shards); fixed-depth neural networks also have a speed prior.
(See also the independently written Gradient descent doesn’t select for inner search)
EDIT: Reading your reply comment on that post
Memoization and heuristics would definitely count as part of a “spread” of contextually activated computations? Are we even disagreeing?
Humans are the one example we have of general intelligences; they surely have different e.g. inductive biases than ML, and that’s damn important. But even so, humans do not search in every situation in order to optimize a simple objective. Seems like an important hint.
More generally: “If your theory of alignment and/or intelligence is correct, why doesn’t it explain the one datapoint we have on general intelligence?”
any “simplicity prior” that ANNs have is not like the simplicity prior of a programming language. A single forwards pass is acyclic, so loops / recursion are impossible. If NN layers were expressed as programs, the language in question would also have to be acyclic, which would make “search” quite a dumb thing to do anyways.
EDIT Although in OP I did presume a recurrent state! Still important to keep in mind as we consider different architectures, though.
Initial contextually-activated-heuristics might (low-confidence) starve gradients towards search.
Do you mean “hardcoded reward circuit” by “proxy”?
I’m not that committed to the RL frame, but roughly speaking yes. Whatever values we have are probably generated by ~tens of hardcoded things. Anyway, on to the meat of the discussion...
It seems like a whole bunch of people are completely thrown off by use of the word “search”. So let’s taboo that and talk about what’s actually relevant here.
We should expect compression, and we should expect general-purpose problem solving (i.e. the ability to take a fairly arbitrary problem in the training environment and solve it reasonably well). The general-purpose part comes from a combination of (a) variation in what the system needs to do to achieve good performance in training, and (b) the recursive nature of problem solving, i.e. solving one problem involves solving a wide variety of subproblems. Compactness means that it probably won’t be a whole boatload of case-specific heuristics; lookup tables are not compact. A subroutine for reasonably-general planning or problem-solving (i.e. take a problem statement, figure out a plan or solution) is the key thing we’re talking about here. Possibly a small number of such subroutines for a few different problem-classes, but not a large number of such subroutines, because compactness. My guess would be basically just one.
That probably will not look like babble and prune. It may look like a general-purpose heuristic-generator (like e.g. relaxation based heuristic generation). Or it may look like general-purpose efficiency tricks, like caching solutions to common subproblems. Or it may look like harcoded heuristics which are environment-specific but reasonably goal-agnostic (like e.g. the sort of thing in Mazes and Duality yields a maze-specific heuristic, but one which applies to a wide variety of path finding problems within that maze). Or it may look like harcoded strategies for achieving instrumentally convergent goals in the training environment (really this is another frame of caching solutions to common subproblems). Or it may look like learning instrumentally convergent concepts and heuristics from the training environment (i.e. natural abstractions; really this is another frame on environment-specific but goal-agnostic heuristics). Probably it’s a combination of all of those, and others too.
The important point is that it’s a problem-solving subroutine which is goal-agnostic (though possibly environment-specific). Pass in a goal, it figures out how to achieve that goal. And we do see this with humans: you can give humans pretty arbitrary goals, pretty arbitrary jobs to do, pretty arbitrary problems to solve, and they’ll go figure out how to do it.
I agree that AGI will need general purpose problem solving routines (by definition). I also agree that this requires something like recursive decomposition of problems into subproblems. I’m just very skeptical that the kinds of neural nets we’re training right now can learn to do anything remotely like that— I think it’s much more likely that people will hard code this type of reasoning into the compute graph with stuff like MCTS. This has already been pretty useful for e.g. MuZero. Once we’re hard coding search it’s less scary because it’s more interpretable and we can see exactly where the mesaobjective is.
I also don’t really buy the compactness argument at all. I think neural nets are biased toward flat minima / broad basins but these don’t generally correspond to “simple” functions in the Kolmogorov sense; they’re more like equivalence classes of diverse bundles of heuristics that all get about the same train and val loss. I’m interpreting this paper as providing some evidence in that direction.
I hope that you’re right; that would make Retargeting The Search very easy, and basically eliminates the inner alignment problem. Assuming, of course, that we can somehow confidently rule out the rest of the net doing any search in more subtle ways.
This seems like roughly what I had in mind by “contextually activated computations” (probably with a few differences about when/how the subroutines will be goal-agnostic). I was imagining computations like “contextually activated cached death-avoidance policy influences” and “contextually activated steering of plans towards paperclip production, in generalizations of the historical reinforcement contexts for paperclip-reward.”
I agree. My recent write-up is partly an attempt to model this dynamic in a toy causal-graph environment. Most relevantly, this section.
Imagine an environment represented as a causal graph, with some action-nodes a an agent can set, observation-nodes whose values that agent can read off, and some reward-node r whose value determines how much reinforcement the agent gets. The agent starts with no information about the environment structure or the environment state. If the reward-node is sufficiently distant from its action-nodes, it’ll take time for the agent’s world-model to become advanced enough to model it. However, the agent would start trying to develop good policies/heuristics for increasing the reward immediately. Thus, its initial policies will necessarily act on proxies: it’ll be focusing on the values of some intermediate nodes between its action-nodes and the reward-node.
And these proxies can be quite good. For example:
xp is a good proxy for controlling the value of r if the xk,xe chain doesn’t perturb it too much. So an agent that only cares about the environment up to xp can capture e. g.X%>90% of the possible maximum reward.
It feels like it shouldn’t matter: that once the world-model is advanced enough to include r directly, the agent should just recognize r as the source of reinforcement, and optimize it directly.
But suppose the heuristics the agent develops have “friction”. That is: once a heuristic has historically performed well enough, the agent is reluctant to replace it with a better but more novel (and therefore untested) one. Or, at least, less willing the less counterfactual reward it promises to deliver. So a heuristic that performs 10x as well as the one it currently has will be able to win against a much older one, but a novel heuristic that only performs 1.1x as well won’t be.
In this case, the marginally more effective policy will not be able to displace a more established one.
(An alternate view: suppose that the agent has two mutually-exclusive heuristic on what to do in a given situation, A and B. A has a good track record, B is a new one, but it’s willing to try B out. Suppose it picks A with probability p and B with 1−p, with p proportional to how long A’s track record is. If the reinforcement B receives is much larger than the reinforcement A receives, then even a rarely-picked B will eventually outpace A. If it’s not much larger, however, then A will be able to “keep up” with B by virtue of being picked more often, and eventually outrace B into irrelevancy.)
Therefore: Yes, the agent will end up optimized for good performance on some proxies of “the human presses the button”. What these proxies are depends on the causal structure of the environment, the percentage of max-reward X optimizing for them allows the agent to capture, and some “friction” value that depends on the agent’s internal architecture.
Major caveat: This mainly only holds for less-advanced systems; for those that are optimized, but do not yet optimize at the strategic level. A hedonist wrapper-mind would have no problems with evaluating whether the new heuristic is actually better, testing it out, and implementing it, no matter how comparably novel it is.
Caveat to the caveat: Such strategic thinking will probably appear after the “values” have already been formed, and at that point the agent will do deceptive alignment to preserve them, instead of self-modifying into a reward-maximizer.
Route warning: This doesn’t mean the agent’s proxies will be friendly or even comprehensible to us. In particular, if the reward structure is
Something makes me smile→I smile→I press the buttonThen it’s about as likely (very not) that the agent will end up focusing on “I smile” as on “I press the button”, since there’s basically just a single causal step. Much more likely is that it’ll value some stuff upstream of “something makes me smile”; possibly very strange stuff.
Note: Using the “antecedent-computation-reinforcer” term really makes all of this clearer, but it’s so unwieldy. Any ideas for coining a better term?
Quoting Rob Bensinger quoting Eliezer:
I think this is very improbable, but thanks for the quote. Not sure if it addresses my question?
Yudkowsky & I would of course agree that that is very improbable. It’s just an example.
The point I was making with this quote is that the question you are asking is a Big Old Unsolved Problem in the literature. If we had any idea what sort of utility function the system would end up with, that would be great and an improvement over the status quo. Yudkowsky’s point in the quote is that it’s a complicated multi-step process we currently don’t have a clue about, it’s not nearly as simple as “the system will maximize reward.” A much better story would be “The system will maximize some proxy, which will gradually evolve via SGD to be closer and closer to reward, but at some point it’ll get smart enough to go for reward for instrumental convergence reasons and at that point its proxy goal will crystallize.” But this story is also way too simplistic. And it doesn’t tell us much at all about what the proxy will actually look like, because so much depends on the exact order in which various things are learned.
I should have made it just a comment, not an answer.
I actually doubt that claim in its stronger forms. I think there’s some substantial effect, but e.g. whether a child loves their family doesn’t depend strongly on the precise curriculum at grade school.
Yet whether a child grows up to work on x-risk reduction vs. homeless shelters vs. voting Democrats out of office vs. voting Republicans out of office does often depend on the precise curriculum in college+high school.
(I think we are in agreement here. I’d be interested to hear if you can point to any particular value AGI will probably have, or (weaker) any particular value such that if AGI has it, it doesn’t depend strongly on the curriculum, order in which concepts are learned, etc.)
I don’t know what the agent would end up valuing in this scenario either. I think a pretty good research direction for shard theory would be to closely examine the training history of the agent to find particular episodes where qualitatively new behaviors seem to form / new circuits arise in the neural net. This might allow you to identify shards, whereas it seems much harder to do that if you’re only looking at the final trained weights of the agent.
Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. “make things of XYZ type think things are going OK”, “try to promote physical configurations like such-and-such”, “trying to stimulate the reinforcer I observe in my environment”. Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.
Pretrained models don’t need any exploration to know that pressing the reward button gets more reward than doing things the humans want. If you just ask GPT3, it’ll tell you that.
Then the only exploration the AI needs is to get reward after thinking about analogies between its situation and its textual knowledge of AI/reinforcement learning/AI doom scenarios.
This applies especially much to simple/often discussed tasks such as making people smile—an LM has already heard of this exact task, so if it took an action based on the “make people smile task” its heard about, this could outperform other thought processes which are only conditioned on data so far.
OK, but that’s a predictive fact in the world model, not a motivational quantity in the policy. I know about my reward center too, and my brain does RL of some kind, but I don’t primarily care about reward.
Here’s a plausible story to me:
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used