Views are my own.
lberglund
Also, if we post another comment thread a week later, who will see it? EAF/LW don’t have sufficient ways to resurface old but important content.
This doesn’t seem like an issue. You could instead write a separate post a week later which has a chance of gaining traction.
I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it’s useful I could make it into it’s own top-level post.
Summary of the summary
In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk looks at the high path dependence world and the low path dependence world and concludes that deception is favored by the inductive biases in both cases.
In the high path dependence world that’s because it’s harder for SGD to develop a good pointer to the training objective than it is to just modify the model’s inner objective to be long-term. In a low path dependence world, Evan thinks that the training process’s bias toward simpler models favors deceptively aligned models. But a bias toward faster inference favors robustly aligned models. However, a speed prior is unlikely to be competitive.
Thus, in both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.
Summary
The post defines deceptive alignment as occurring when “the reason the model looks aligned in training is because it is actively trying to do so for instrumental reasons.” It argues that once a model is deceptively aligned, it will be indistinguishable from a robustly aligned model. Thus the question “Is deception the default?” is a question about the inductive biases of the training process we are using. We need to determine how likely the training process is to find a deceptively aligned model rather than a robustly aligned model.
To answer this question Evan examines two possible worlds, the high path-dependence world and the low path-dependence world. In the high path dependence world the model depends strongly on the path the training process took through model space, so the model’s behavior outside of training is heavily influenced by its training path. In the low path dependence world we get pretty much the same models regardless of the training path that they took.
Deceptive alignment in the high path-dependence world
We first assume that the training process is complex enough such that to succeed, the model has to understand what we want. Evan posits three classes of models that do well in training:
Internal alignment – The model’s internal goals are equivalent to the training objective.
Corrigible alignment – The model’s goals are a pointer to the part of it’s internal representation of the training objective.
Deceptive alignment – The model has some long-term goal and is instrumentally motivated to achieve its training objective.
To determine the relative likelihood of these classes, Evan constructs the training path that he thinks is most likely to result in each class of model and then examines how likely this training path is to happen. In particular he considers two factors here,
How much marginal performance improvement do you get per step?
How long is the path?
Internal alignment
The internal alignment path is,
We start with a proxy-aligned model
SGD continuously improves the proxy until it corresponds to the training objective, before it learns about the training process
The model learns about the training process from its input data
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
Evan sees this path as unlikely, since it would require the proxy to align perfectly with the training objective before it learns about the training process. This is unlikely because there are diminishing returns to improving the proxy as well as to improving the world model, so it would be inefficient for SGD to allocate all its improvements to the proxy rather than the world model. What’s more likely is that the proxy and the world model will improve in parallel, so it won’t perfect its proxy before it develops situational awareness.
Corrigible alignment
The corrigible alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD replaces the model’s proxies with a pointer to its understanding of the training objective
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned
One way to view this is that after step 3, the model has an overhang where its world-model represents the training process more accurately than its proxy-goal. Here, SGD resolves the overhang by throwing out the proxies and replacing them with a pointer to the training objective.
One issue is that it might be hard to specify this pointer.
For example, maybe it learns a pointer to whatever’s encoded in this computer, or whatever this human says, or whatever the human sitting in that chair says, or whatever Bob the head of the company says.
The longer it takes for SGD to find the correct pointer the less likely the corrigible alignment path becomes.
Deceptive alignment
The deceptive alignment path is,
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around
The model’s proxies “crystallize,” as they are no longer relevant to performance, and we reach an equilibrium
Here, the first three steps are the same as in the corrigible alignment path, so the model has the same overhang as before. This time, SGD solves the overhang by modifying the proxy goals into long-term goals. For example, if the model’s original proxy goal was “maximize gold coins in the current training run,” they would be changed to “maximize gold coins across all future training runs,” or something. Once the model has long term goals it starts performing well in training for instrumental purposes.
Evan sees this scenario as more likely, since the modification that SGD has to perform to the proxy objective is relatively easy compared to finding the correct pointer as in the corrigible alignment scenario.
Deceptive alignment in a low path dependence world
Here we’re going to again assume that the model understands the training process. Additionally we will assume that the process’s inductive biases are some combination of simplicity and speed. To figure out which types of models are more likely we are going to check which what classes of models are favored by having a bias towards simplicity and speed.
Simplicity – How complex is it to specify some algorithm in the weights?
From the transcript,
You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?”
Evan thinks the deceptive models are the simplest. That’s because, roughly speaking, he thinks all agents will have a world model, an optimization procedure, and an internal objective. The first two components will be about the same for all three models, so the complexity question boils down to how complex each model class’s objective is.
The internally aligned objective is really complex, since it has to be the exact same as the training objective. The corrigible alignment objective is less complex because it just has a to be a pointer to the right parts of the world model. The deceptive alignment objective is super simple, since it could be any long-term objective. Thus the complexity prior favors deception.
Speed – How much computation does that algorithm take at inference time?
Evan thinks deceptive models loose out here.
Well, here we’re trying to understand how much computation you have to perform for each model class to result in the correct answer. The internally aligned one already has exactly what the thing is that it is trying to optimize for, and it just directly executes on that; the corrigibly aligned one, maybe it takes a little bit more because it has to figure out what the pointer is pointing to, but overall it’s pretty similar.
But the deceptively aligned one has to do at least a couple of extra steps, because it reasons, “Okay, I care about gold coins,” but just knowing that it cares about gold coins is not yet sufficient to be able to do a good job in the training environment. It has to do a couple of extra steps of reasoning where it’s like, “Okay, I care about gold coins and caring about gold coins in the long term means that I should try to optimize the training goal temporarily to try to get more gold coins eventually.” And so that adds extra steps of reasoning and some cost to the overall computation of the model. So the deceptively aligned one is in some sense slower than the others.
However, Evan doesn’t think that a speed prior will save us from deception because it is uncompetitive. Essentially, Evan sees the real world as having a simplicity prior, so simpler functions tend to generalize better. However, simpler functions are not always the fastest to compute. As a result, a speed prior will prevent SGD from finding the functions that generalize best. He cites double descent as evidence for this, which I won’t describe in this summary.
Conclusion
In both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.
It’s worth emphasizing your point about the negative consequences of merely aiming for a pivotal act.
Additionally, if a lot of people in the AI safety community advocate for a pivotal act, it makes people less likely to cooperate with and trust that community. If we want to make AGI safe, we have to be able to actually influence the development of AGI. To do that, we need to build a cooperative relationship with decision makers. Planning a pivotal act runs counter to these efforts.
[I may be generalizing here and I don’t know if this has been said before.]
It seems to me that Eliezer’s models are a lot more specific than people like Richard’s. While Richard may put some credence on superhuman AI being “consequentialist” by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.
I think Eliezer’s style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are generally uncertain. But Eliezer has specific models that make some scenarios a lot more likely in his mind.
There are many valid theoretical arguments for why we are doomed, but maybe other EAs put less credence in them than Eliezer does.
I struggled at first to see the analogy being made to AI here. In case it helps others, here is my interpretation:
Near-future (or current ?) LLMs are the planes here, humans are the birds.
These LLMs will soon be able to perform many of the most important cognitive functions that humans can do. In the analogy, these are the flying-related functions that planes perform.
As with current LLMs, there will always be certain tasks that humans are better at, such as motor control or humor. That’s because humans are highly specialized for certain tasks that aren’t actually necessary for most capabilities.
We shouldn’t conclude from the fact that birds can do things that planes can’t, that we haven’t “solved flying.” Similarly, just because LLMs can’t do everything humans can , that doesn’t mean we haven’t “solved intelligence.”
This essay by Jacob Steinhardt makes a similar (and fairly fleshed out) argument.
GPT-4 can follow the rules of tic-tac-toe, but it cannot play optimally. In fact it often passes up opportunities for wins. I’ve spent about an hour trying to get GPT-4 to play optimal tic-tac-toe without any success.
Here’s an example of GPT-4 playing sub-optimally: https://chat.openai.com/share/c14a3280-084f-4155-aa57-72279b3ea241
Here’s an example of GPT-4 suggesting a bad move for me to play: https://chat.openai.com/share/db84abdb-04fa-41ab-a0c0-542bd4ae6fa1
I agree that training backwards would likely fix this for a causal decoder LLM.
I would define the Reversal Curse as the phenomenon by which models cannot infer ‘B → A’ by training on examples of the form ‘A → B’. In our paper we weren’t so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote, “We try different setups in an effort to help the model generalize,” we were referring to setups in which a model infers ‘B → A’ without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.
I agree with habryka that the title of this post is a little pedantic and might just be inaccurate, but I nevertheless found the content to be thought-provoking, easy to follow, and well written.
The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.
Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?
Nature of the work: Many organizations are focused on developing ideas and amassing influence that can be used later. CAIP is focused on turning policy ideas into concrete legislative text and conducting advocacy now.
Congrats on launching! Do you have a model of why other organizations are choosing to delay direct legislative efforts? More broadly, what are your thoughts on avoiding the unilateralist’s curse here?
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
So, we can fine-tune a probe at the last layer of the measurement predicting model to predict if there is tampering using these two kinds of data: the trusted set with negative labels and examples with inconsistent measurements (which have tampering) with positive labels. We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).
This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn’t it actually learning to distinguish real negatives and fake positives, since that’s what it’s being trained on? (Might be a typo.)
I would guess that you could train models to perfect play pretty easily, since the optimal tic-tac-toe strategy is very simple (Something like “start by playing in the center, respond by playing on a corner, try to create forks, etc”.) It seems like few-shot prompting isn’t enough to get them there, but I haven’t tried yet. It would be interesting to see if larger sizes of gpt-3 can learn faster than smaller sizes. This would indicate to what extent finetuning adds new capabilities rather than surfacing existing ones.
I still find the fact that gpt-4 cannot play tic-tac-toe despite prompting pretty striking on its own, especially since it’s so good at other tasks.
Another slightly silly idea for improving character welfare is filtering or modifying user queries to be kinder. We could do this by having a smaller (and hopefully less morally valuable) AI filter unkind user queries or modify them to be more polite.
I appreciate the post and Connor for sharing his views, but the antimeme thing kind of bothers me.
Here’s my hot take: I think Paul and Eliezer were having two totally different conversations. Paul was trying to have a scientific conversation. Eliezer was trying to convey an antimeme.
An antimeme is something that by its very nature resists being known. Most antimemes are just boring—things you forget about. If you tell someone an antimeme, it bounces off them. So they need to be communicated in a special way. Moral intuitions. Truths about yourself. A psychologist doesn’t just tell you “yo, you’re fucked up bro.” That doesn’t work.
A lot of Eliezer’s value as a thinker is that he notices & comprehends antimemes. And he figures out how to communicate them.
A lot of his frustration throughout the years has been him telling everyone that it’s really really hard to convey antimemes. Because it is.
If you read The Sequences, some of it is just factual explanations of things. But a lot of it is metaphor. It reads like a religious text. Not because it’s a text of worship, but because it’s about metaphors and stories that affect you more deeply than facts.
What happened in the MIRI dialogues is that Eliezer was telling Paul “hey, I’m trying to communicate an antimeme to you, but I’m failing because it’s really really hard.”
Does Connor ever say what antimeme Eliezer is trying to convey, or is it so antimemetic that no one can remember it long enough to write it down?
I understand that if this antimeme stuff is actually true, these ideas will be hard to convey. But it’s really frustrating to hear Connor keep talking about antimemes while not actually mentioning what these antimemes are and what makes them antimemetic. Also, saying “There are all these antimemes out there but I can’t convey them to you” is a frustratingly unfalsifiable statement.
This is surprising – thanks for bringing this up!
The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn’t actually helpful, which implies we don’t need to worry about LLMs giving people expert-level biology knowledge.
Some alternative interpretations:
The study doesn’t accurately measure the gap between experts and non-expert
The knowledge needed to build a bioweapon is super niche. It’s not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
LLMs won’t help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).
Cool work! It seems like one thing that’s going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar.
Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren’t used in B. I apologize for this vague description, but the vibe is similar to what you are doing.
The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that’s socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don’t get along).
So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.
Relevant: In What 2026 Looks Like, Daniel Kokotajlo predicted expert level Diplomacy play would be reached in 2025.
I’m mentioning this, not to discredit Daniel’s prediction, but to point out that this seems like capabilities progress ahead of what some expected.