I’d be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results.
Linkpost: ‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting
Linkpost: A Contra AI FOOM Reading List
Linkpost: A tale of 2.5 orthogonality theses
Thanks for sharing your perspective! I’ve written up detailed arguments that deceptive alignment is unlikely by default. I’d love to hear what you think of it and how that fits into your view of the alignment landscape.
Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”
Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.
However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.
A model needs situational awareness, a long-term goal, a way to tell if it’s in training, and a way to identify the base goal that isn’t its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely.
Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training.
Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I’d be interested to hear what you think of it!
This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I’d love to hear what you think of it and discuss further!
[Question] Counterarguments to Core AI X-Risk Stories?
I recently made an inside view argument that deceptive alignment is unlikely. It doesn’t cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I’d love to hear what you think of it!
This is an interesting point, but it doesn’t undermine the case that deceptive alignment is unlikely. Suppose that a model doesn’t have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn’t understand the correct abstraction, it can’t instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can’t be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model’s goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here.
1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?
Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment.
What amount of understanding of the base goal is sufficient? What if the answer is “It has to be quite a lot, otherwise it’s really just a proxy that appears superficially similar to the base goal?” In that case the classic arguments for deceptive alignment would work fine.
TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference.
I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.
Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:
Update the existing proxy goal to be closer to the training goal, or
Create a new concept for the training goal and become deceptively aligned.
Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately.
For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent.
Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned.
A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely.
Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete.
1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don’t like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3′s are chill about value drift and not particularly interested in taking over the world. Maybe I’m reading too much between the lines, but I’d say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble.
Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.
I guess this isn’t an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn’t count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn’t.
Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk.
Presumably the brain has some sort of SGD-like process for updating the synapses over time, that’s how we learn. It’s probably not exactly the same but still, couldn’t you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they’d grow up and go to college and live the rest of their life all for the sake of pursuing reward?
We know how the gradient descent mechanism works, because we wrote the code for that.
We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward.
In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies.
If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup:
Training goal: following directions subject to ethical considerations.
Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way.
Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information.
But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined.
Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others.
If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective.
It seems like this discussion would benefit if we consulted someone with expertise on human value formation.
I’m not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I’d expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.
Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced “fuzzy targets” with “non-maximization targets”.
- 2 Mar 2023 15:54 UTC; 2 points)'s comment on Deceptive Alignment is <1% Likely by Default by (
- 2 Mar 2023 16:43 UTC; 1 point)'s comment on A case for capabilities work on AI as net positive by (
- 13 Mar 2023 13:32 UTC; 1 point)'s comment on Deceptive Alignment by (
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.
That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That’s definitely my mistake, sorry!
I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That’s important and hard to do!
Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I’m glad you found it valuable!
For what it’s worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don’t address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important.
However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.
If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn’t start out with an unaligned goal, it starts out with no goal.
To make the deception argument work, you need to describe why deception would emerge in the first place. Assuming a model is deceptive to show a model could become deceptive is not persuasive.
One reason to think that models will often care about cross episode rewards is that caring about the future is a natural generalization. In order for a reward function to be myopic, it must contain machinery that does something like “care about X in situations C and not situations D”, which is more complicated than “care about X”.
Models don’t start out with long-term goals. To care about long-term goals, they would need to reason about and predict future outcomes. That’s pretty sophisticated reasoning to emerge without a training incentive. Why would they learn to do that if they are trained on myopic reward? Unless a model is already deceptive, caring about future reward will have a neutral or harmful effect on training performance. And we can’t assume the model is deceptive, because we’re trying to describe how deceptive alignment (and long-term goals, which are necessary for deceptive alignment) would emerge in the first place. I think long-term goal development is unlikely to emerge by accident.
Thanks for writing this up clearly! I don’t agree that gradient descent favors deception. In fact, I’ve made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I’d love to hear what you think of my arguments!
Thanks for writing this! If you’re interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn’t cover other risk scenarios, but I would love to hear what you think!
Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn’t directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn’t (yet) deceptively aligned.
I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won’t be dangerous and should shift when the model switches to reinforcement learning.
This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.