DavidW

Karma: 314

DavidW Feb 27, 2023, 10:10 PM
4 points
3
on: A case for capabilities work on AI as net positive
Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I’m glad you found it valuable!
For what it’s worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don’t address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important.

DavidW Feb 27, 2023, 4:09 PM
8 points
0
on: Does SGD Produce Deceptive Alignment?
However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.
If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn’t start out with an unaligned goal, it starts out with no goal.
To make the deception argument work, you need to describe why deception would emerge in the first place. Assuming a model is deceptive to show a model could become deceptive is not persuasive.
One reason to think that models will often care about cross episode rewards is that caring about the future is a natural generalization. In order for a reward function to be myopic, it must contain machinery that does something like “care about X in situations C and not situations D”, which is more complicated than “care about X”.
Models don’t start out with long-term goals. To care about long-term goals, they would need to reason about and predict future outcomes. That’s pretty sophisticated reasoning to emerge without a training incentive. Why would they learn to do that if they are trained on myopic reward? Unless a model is already deceptive, caring about future reward will have a neutral or harmful effect on training performance. And we can’t assume the model is deceptive, because we’re trying to describe how deceptive alignment (and long-term goals, which are necessary for deceptive alignment) would emerge in the first place. I think long-term goal development is unlikely to emerge by accident.

DavidW Feb 27, 2023, 3:31 PM
4 points
on: Does SGD Produce Deceptive Alignment?
Thanks for writing this up clearly! I don’t agree that gradient descent favors deception. In fact, I’ve made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I’d love to hear what you think of my arguments!

DavidW Feb 27, 2023, 2:45 PM
4 points
0
on: Normie response to Normie AI Safety Skepticism
Thanks for writing this! If you’re interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn’t cover other risk scenarios, but I would love to hear what you think!

DavidW Feb 26, 2023, 2:40 PM
3 points
0
on: Incentives and Selection: A Missing Frame From AI Threat Discussions?
Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!

DavidW Feb 24, 2023, 5:00 PM
2 points
0
in reply to: David Johnston’s comment on: Order Matters for Deceptive Alignment
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next.
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives.
Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?
Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original?

DavidW Feb 22, 2023, 5:09 PM
3 points
4
in reply to: the gears to ascension’s comment on: Deceptive Alignment is <1% Likely by Default
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.

DavidW Feb 22, 2023, 12:40 PM
8 points
0
in reply to: PeterMcCluskey’s comment on: Deceptive Alignment is <1% Likely by Default
Where do you see weak points in the argument?

DavidW Feb 21, 2023, 9:11 PM
1 point
0
in reply to: the gears to ascension’s comment on: Deceptive Alignment is <1% Likely by Default
Do you think language models already exhibit deceptive alignment as defined in this post?
I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.
If so, I’d be very interested to see examples of it!

DavidW Feb 21, 2023, 4:06 PM
1 point
0
in reply to: Henry Prowbell’s comment on: How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century?
I just posted a detailed explanation of why I am very skeptical of the traditional deceptive alignment story. I’d love to hear what you think of it!
Deceptive Alignment Skepticism—LessWrong

DavidW Feb 17, 2023, 3:31 PM
1 point
0
in reply to: NickGabs’s comment on: Order Matters for Deceptive Alignment
If the model is sufficiently good at deception, there will be few to no differential adversarial examples.
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
the vastly larger number of misaligned goals
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?

DavidW Feb 16, 2023, 12:04 AM
18 points
12
in reply to: evhub’s comment on: Order Matters for Deceptive Alignment
Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge:
1. Goal directedness coming before an understanding of the base goal
2. Long-term goals coming before or around the same time as an understanding of the base goal
3. Situational awareness coming before or around the same time as an understanding of the base goal
The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without discussion of sequencing. Long-term goals (#2) are described as happening as a result of an inductive bias toward deceptive alignment, and sequencing is not highlighted for that property. Please let me know if I missed anything in your post, and apologies in advance if that’s the case.
Do you agree that these three property development orders are necessary for deception?