DavidW

Karma: 306

Order Matters for Deceptive Alignment

DavidW15 Feb 2023 19:56 UTC

57 points

19 comments7 min readLW link

DavidW 16 Feb 2023 0:04 UTC
18 points
12
in reply to: evhub’s comment on: Order Matters for Deceptive Alignment
Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge:
1. Goal directedness coming before an understanding of the base goal
2. Long-term goals coming before or around the same time as an understanding of the base goal
3. Situational awareness coming before or around the same time as an understanding of the base goal
The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without discussion of sequencing. Long-term goals (#2) are described as happening as a result of an inductive bias toward deceptive alignment, and sequencing is not highlighted for that property. Please let me know if I missed anything in your post, and apologies in advance if that’s the case.
Do you agree that these three property development orders are necessary for deception?

DavidW 17 Feb 2023 15:31 UTC
1 point
0
in reply to: NickGabs’s comment on: Order Matters for Deceptive Alignment
If the model is sufficiently good at deception, there will be few to no differential adversarial examples.
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
the vastly larger number of misaligned goals
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

94 points

26 comments14 min readLW link

DavidW 21 Feb 2023 16:06 UTC
1 point
0
in reply to: Henry Prowbell’s comment on: How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century?
I just posted a detailed explanation of why I am very skeptical of the traditional deceptive alignment story. I’d love to hear what you think of it!
Deceptive Alignment Skepticism—LessWrong

DavidW 21 Feb 2023 21:11 UTC
1 point
0
in reply to: the gears to ascension’s comment on: Deceptive Alignment is <1% Likely by Default
Do you think language models already exhibit deceptive alignment as defined in this post?
I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.
If so, I’d be very interested to see examples of it!

DavidW 22 Feb 2023 12:40 UTC
8 points
0
in reply to: PeterMcCluskey’s comment on: Deceptive Alignment is <1% Likely by Default
Where do you see weak points in the argument?

DavidW 22 Feb 2023 17:09 UTC
3 points
4
in reply to: the gears to ascension’s comment on: Deceptive Alignment is <1% Likely by Default
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.

DavidW 24 Feb 2023 17:00 UTC
2 points
0
in reply to: David Johnston’s comment on: Order Matters for Deceptive Alignment
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next.
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives.
Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?
Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original?

DavidW 26 Feb 2023 14:40 UTC
3 points
0
on: Incentives and Selection: A Missing Frame From AI Threat Discussions?
Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!

DavidW 27 Feb 2023 14:45 UTC
4 points
0
on: Normie response to Normie AI Safety Skepticism
Thanks for writing this! If you’re interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn’t cover other risk scenarios, but I would love to hear what you think!

DavidW 27 Feb 2023 15:31 UTC
4 points
on: Does SGD Produce Deceptive Alignment?
Thanks for writing this up clearly! I don’t agree that gradient descent favors deception. In fact, I’ve made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I’d love to hear what you think of my arguments!

DavidW 27 Feb 2023 16:09 UTC
2 points
on: Does SGD Produce Deceptive Alignment?
However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.
If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn’t start out with an unaligned goal, it starts out with no goal.
To make the deception argument work, you need to describe why deception would emerge in the first place. Assuming a model is deceptive to show a model could become deceptive is not persuasive.
One reason to think that models will often care about cross episode rewards is that caring about the future is a natural generalization. In order for a reward function to be myopic, it must contain machinery that does something like “care about X in situations C and not situations D”, which is more complicated than “care about X”.
Models don’t start out with long-term goals. To care about long-term goals, they would need to reason about and predict future outcomes. That’s pretty sophisticated reasoning to emerge without a training incentive. Why would they learn to do that if they are trained on myopic reward? Unless a model is already deceptive, caring about future reward will have a neutral or harmful effect on training performance. And we can’t assume the model is deceptive, because we’re trying to describe how deceptive alignment (and long-term goals, which are necessary for deceptive alignment) would emerge in the first place. I think long-term goal development is unlikely to emerge by accident.

DavidW 27 Feb 2023 22:10 UTC
4 points
3
on: A case for capabilities work on AI as net positive
Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I’m glad you found it valuable!
For what it’s worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don’t address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important.

DavidW 27 Feb 2023 22:43 UTC
1 point
0
in reply to: Noosphere89’s comment on: A case for capabilities work on AI as net positive
That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That’s definitely my mistake, sorry!
I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That’s important and hard to do!

DavidW 28 Feb 2023 21:50 UTC
5 points
1
in reply to: Daniel Kokotajlo’s comment on: Deceptive Alignment is <1% Likely by Default
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.

DavidW 2 Mar 2023 14:48 UTC
8 points
0
in reply to: Daniel Kokotajlo’s comment on: Order Matters for Deceptive Alignment
Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment.
#4:
What amount of understanding of the base goal is sufficient? What if the answer is “It has to be quite a lot, otherwise it’s really just a proxy that appears superficially similar to the base goal?” In that case the classic arguments for deceptive alignment would work fine.
TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference.
I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.
Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:
1. Update the existing proxy goal to be closer to the training goal, or
2. Create a new concept for the training goal and become deceptively aligned.
Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately.
For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent.
Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned.
A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely.
Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete.
1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don’t like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3′s are chill about value drift and not particularly interested in taking over the world. Maybe I’m reading too much between the lines, but I’d say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble.
Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.
#2:
I guess this isn’t an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn’t count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn’t.
Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk.
#3:
Presumably the brain has some sort of SGD-like process for updating the synapses over time, that’s how we learn. It’s probably not exactly the same but still, couldn’t you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they’d grow up and go to college and live the rest of their life all for the sake of pursuing reward?
We know how the gradient descent mechanism works, because we wrote the code for that.
We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward.
In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies.
If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup:
1. Training goal: following directions subject to ethical considerations.
2. Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way.
3. Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information.
But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined.
Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others.
If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective.
It seems like this discussion would benefit if we consulted someone with expertise on human value formation.
5.
I’m not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I’d expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.
Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced “fuzzy targets” with “non-maximization targets”.
What links here?

DavidW 2 Mar 2023 15:54 UTC
2 points
−1
in reply to: Daniel Kokotajlo’s comment on: Deceptive Alignment is <1% Likely by Default
1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?
What links here?
- DavidW's comment on Deceptive Alignment is <1% Likely by Default by DavidW (EA Forum; 4 Mar 2023 17:02 UTC; 2 points)

DavidW 2 Mar 2023 16:43 UTC
1 point
0
in reply to: Adele Lopez’s comment on: A case for capabilities work on AI as net positive
This is an interesting point, but it doesn’t undermine the case that deceptive alignment is unlikely. Suppose that a model doesn’t have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn’t understand the correct abstraction, it can’t instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can’t be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model’s goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here.

DavidW 3 Mar 2023 18:48 UTC
3 points
3
in reply to: Adam Zerner’s comment on: Robin Hanson’s latest AI risk position statement
I recently made an inside view argument that deceptive alignment is unlikely. It doesn’t cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I’d love to hear what you think of it!

DavidW

Order Mat­ters for De­cep­tive Alignment

De­cep­tive Align­ment is <1% Likely by Default

Order Matters for Deceptive Alignment

Deceptive Alignment is <1% Likely by Default