Alex Turner, independent researcher working on AI alignment. Reach me at turner.alex[at]berkeley[dot]edu.
TurnTrout
As garrett says—not clear that this work is net negative. Skeptical that it’s strongly net negative. Haven’t read deeply, though.
Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.
LLMs aren’t trained to convergence because that’s not compute-efficient, so early stopping seems like the relevant baseline. No?
everyone who reads those seems to be even more confused after reading them
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don’t think it’s true. For some reason, some people really get a lot out of the post; others think it’s trivial; others think it’s obviously wrong, and so on. See Rohin’s comment:
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
You write:
In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not ‘optimize the reward’?
These algorithms do optimize the reward. My post addresses the model-free policy gradient setting… [goes to check post] Oh no. I can see why my post was unclear—it didn’t state this clearly. The original post does state that AIXI optimizes its reward, and also that:
For point 2 (reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
I don’t know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you—I’m happy to answer more specific questions.
I agree that with time, we might be able to understand. (I meant to communicate that via “might still be incomprehensible”)
All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.
Strong claim! I’m skeptical (EDIT: if you mean “in the limit” to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn’t matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)
I think the more relevant question is “given a frozen initial network, what are the circuit-level inductive biases of the training process?”. I doubt one can answer this via appeals to RL convergence results.
(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)
a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward.
As a side note, I think this “agent only wants to maximize reward” language is unproductive (see “Reward is not the optimization target”, and “Think carefully before calling RL policies ‘agents’”). In this case, I suspect that your language implicitly equivocates between “agent” denoting “the RL learning process” and “the trained policy network”:
As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.
(The original post was supposed to also have @Monte M as a coauthor; fixed my oversight.)
This paper enhances the truthful accuracy of large language models by adjusting model activations during inference. Using a linear probe, they identify attention heads which can strongly predict truthfulness on a validation dataset. During each forward pass at inference time, they shift model activations in the truthful directions identified by the probe.
While this paper did examine shifting along the probe direction, they found that to work substantially worse than shifting along the mean activation difference between (about to say truthful thing) and (about to say untruthful thing). See table 3.
AI cognition doesn’t have to use alien concepts to be uninterpretable. We’ve never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.
Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed.
Or:
The existence of a natural-language description of a thought (like “I want ice cream”) doesn’t mean that your brain computed that thought in a way which can be compactly described by familiar concepts.
Conclusion: Even if an AI doesn’t rely heavily on “alien” or unknown abstractions—even if the AI mostly uses human-like abstractions and features—the AI’s thoughts might still be incomprehensible to us, even if we took a lot of time to understand them.
I want to note that it’s really hard to properly represent other people’s views and intuitions, and instead aimed to strawman each agenda ~equally[1] for brevity and humor.
A bunch of the presidents make critiques and defenses weaker than the ones I’d make. There are a bunch of real hot takes of mine in this video, generally channeled via Trump (who also drops a few pretty dumb takes IMO). (Which Trump-takes are dumb and which are based? Well, that’s up to the viewer to figure out by thinking for themselves!)
- ^
With the exception of infrabayesianism, which wasn’t treated seriously.
- ^
AI presidents discuss AI alignment agendas
This is really cool. Great followup work!
I think this is enough to make a hypothesis on how the network works and how the goal misgeneralization happens:
Somewhere inside the model, there is a set of individual components that respond to different inputs, and when they activate, they push for a particular action. Channel 121 is an example of such a component.
The last layers somehow aggregate information from all of the individual components.
Components sometimes activate for the action that leads to the cheese and sometimes for the action that leads to the top right corner.[9]
If the aggregated “push” for the action leading to the cheese is higher than for the action leading to the top right corner, the mouse goes to the cheese. Otherwise, it goes to the top right corner.
I think this is basically a shard theory picture/framing of how the network works: Inside the model there are multiple motivational circuits (“shards”) which are contextually activated (i.e. step 3) and whose outputs are aggregated into a final decision (i.e. step 4).
ActAdd: Steering Language Models without Optimization
(Also, all AI-doom content should maybe be expunged as well, since “AI alignment is so hard” might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
I agree that there’s something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn’t a crux.)
(Medium confidence) FWIW, RLHF’d models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts.
This paper seems pretty cool!
I’ve for a while thought that alignment-related content should maybe be excluded from pretraining corpora, and held out as a separate optional dataset. This paper seems like more support for that, since describing general eval strategies and specific evals might allow models to 0-shot hack them.
Other reasons for excluding alignment-related content:
“Anchoring” AI assistants on our preconceptions about alignment, reducing our ability to have the AI generate diverse new ideas and possibly conditioning it on our philosophical confusions and mistakes
Self-fulfilling prophecies around basilisks and other game-theoretic threats
I’ve been interested in using this for red-teaming for a while—great to see some initial work here. I especially liked the dot-product analysis.
This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the “answer questions” vector is already a jailbreak vector). Thankfully, activation additions can’t be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can’t be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)
In practice, we focus on the embedding associated with the last token from a late layer.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
Focusing on language models, we note that models exhibit “consistent developmental stages,” at first behaving similarly to -gram models and later exhibiting linguistic patterns.
I wrote a shortform comment which seems relevant:
Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there’s an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. “object permanence”).
Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on “reward specification.” I think reward specification is important, but far from the full story.
To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal).
Some relevant quotes from the paper:
In offline reinforcement learning (RL), an agent optimizes its performance given an offline dataset. Despite being its main objective, we find that return maximization is not sufficient for explaining
some of its empirical behaviors. In particular, in many existing benchmark datasets, we observe that offline RL can produce surprisingly good policies even when trained on utterly wrong reward labels....
We trained ATAC agents on the original datasets and on three modified versions of each dataset, with “wrong” rewards: 1) zero: assigning a zero reward to all transitions, 2) random: labeling each transition with a reward sampled uniformly from , and 3) negative: using the negation of the true reward. Although these wrong rewards contain no information about the underlying task or are even misleading, the policies learned from them often perform significantly better than the behavior (data collection) policy and the behavior cloning (BC) policy. They even outperform policies trained with the true reward in some cases.
...
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is “nudged” to learn a desirable behavior with imperfect reward but purposely biased data coverage.
...
While a large data coverage improves the best policy that can be learned by offline RL with the true reward, it can also make offline RL more sensitive to imperfect rewards. In other words, collecting a large set of diverse data might not be necessary or helpful. This goes against the common wisdom in the RL community that data should be as exploratory as possible.
...
We believe that our findings shed new light on RL applicability and research. To practitioners, we demonstrate that offline RL does not always require the correct reward to succeed.
Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don’t care very much about delicious food, and it seems like the vast majority of people don’t make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits.
If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it.
This is compatible with one intended main point of this essay, which is that while reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?
Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).