But then the post immediately jumps to a far narrower definition of “deceptive alignment”:
In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.
… In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.
If we look at e.g. the cases in Worlds Where Iterative Design Fails, most of them fit the more-general definition, yet none necessarily involve an AI which is “actively adversarial but pretends not to be”. And that’s exactly the sort of mistake people make when they focus exclusively on a single failure mode: they end up picturing a much narrower set of possibilities than the argument for focusing on that failure mode actually assumes.
Now, an oversight like that undermines the case for “focus only/mostly on deceptive alignment”, but doesn’t make it very actively harmful. The reason it’s actively harmful is unknown unknowns.
Unknown Unknowns
Claim: the single most confident prediction we can make about AGI is that there will be surprises. There will be unknown unknowns. There will be problems we do not currently see coming.
The thing which determines humanity’s survival will not be whether we solve alignment in whatever very specific world, or handful of worlds, we imagine to be most probable. What determines humanity’s survival will be whether our solutions generalize widely enough to handle the things the world actually throws at us, some of which will definitely be surprises.
How do we build solutions which generalize to handle surprises? Two main ways. First, understanding things deeply and thoroughly enough to enumerate every single assumption we’ve made within some subcomponent (i.e. mathematical proofs). That’s great when we can do it, but it will not cover everything or even most of the attack surface in practice. So, the second way to build solutions which generalize to handle surprises: plan for a wide variety of scenarios. Planning for a single scenario—like e.g. an inner agent emerging during training which is actively adversarial but pretends not to be—is a recipe for generalization failure once the worlds starts throwing surprises at us.
Psychologizing
At this point I’m going to speculate a bit about your own thought process which led to this post; obviously such speculation can easily miss completely and you should feel free to tell me I’m way off.
First, I notice that nowhere in this post do you actually compare deceptive alignment to anything else I’d consider an important-for-research alignment failure mode (like fast takeoff, capability gain in deployment, getting what we measure, etc). You just argue that (a rather narrow version of) deception is important, not that it’s more important than any of the other failure modes I actually think about. I also notice in the “implications” section:
I should ask “how does this help with deceptive alignment?” before starting a new project. In retrospect, I haven’t done that a lot and I think most of my projects, therefore, have not contributed a lot to this question.
What this sounds like to me is that you did not previously have any realistic model of how/why AI would be dangerous. I’m guessing that you were previously only thinking about problems which could be fixed by iterative design—i.e. seeing what goes wrong and then updating the design accordingly. Probably (a narrow version of) deception is the first scenario where you’ve realized that doesn’t work, and you haven’t yet thought of other ways for an iterative design cycle to fail to produce aligned AI.
So my advice would be to brainstorm other things “deception” (in the most general sense) could look like, or other ways the iterative design cycle could fail to produce aligned AI, and try to aim your brainstorming at scenarios which are as different as possible from the things you’ve already thought of.
To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.
However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn’t necessarily mean the literal loss, just anything we’re trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren’t solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we’re capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we’ll end up concluding yes (I’d probably put ~60% on that).
Furthermore, I think there’s a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it’s most valuable to start with the most concerning concrete failure modes we’re aware of. It’s extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.
That’s “relatively broad”??? What notion of “deceptive alignment” is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research … Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing, and working on concrete failure modes is extremely important to that.
This I agree with, but I think it doesn’t go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
Thanks for the comment. It definitely pointed out some things that weren’t clear in my post and head. Some comments: 1. I think your section on psychologizing is fairly accurate. I previously didn’t spend a lot of time thinking about how my research would reduce the risks I care about and my theories of change were pretty vague. I plan to change that now. 2. I am aware of other failure modes such as fast takeoffs, capability gains in deployment, getting what we measure, etc. However, I feel like all of these scenarios get much worse/harder when deception is at play, e.g. fast takeoffs are worse when they are unnoticed and getting what we measure likely leads to worse outcomes if it is hidden. I would really think of them as orthogonal, e.g. getting what we measure could happen in a deceptive or non-deceptive way. But I’m not sure this is a correct framing. 3. It is correct that my definition of deception is inconsistent throughout the article. Thanks for pointing this out. I think it is somewhere between “It’s bad if something happens in powerful AIs that we don’t understand” to “It’s bad if there is an active adversary trying to deceive us”. I’ll need to think about this for longer. 4. Unknown unknowns are a problem. I think my claim as presented in the post is stronger than iI originally intended. However, I think the usefulness of foundational research such as yours comes to a large extent from the fact that it increases our understanding of an AI system in general which then allows us to prevent failure modes (many of which relate to deception).
I’ll try to update the post to reflect some of the discussion and my uncertainty better. Thanks for the feedback.
Something I’d add to “plan for a wide variety of scenarios” is to look for solutions that do not refer to those scenarios. A solution involving [and here we test for deceptive alignment (DA)] is going to generalise badly (even assuming such a test could work), but so too will a solution involving [and here we test for DA, x, y, z, w...].
This argues for not generating all our problem scenarios ahead of time: it’s useful to have a test set. If the solution I devise after only thinking about x and y also works for z and w, then I have higher confidence in it than if I’d generated x, y, z, w before I started looking for a solution.
For this reason, I’m not so pessimistic about putting a lot of effort into solving DA. I just wouldn’t want people to be thinking about DA-specific tests or DA-specific invariants.
I consider this advice actively harmful, and strongly advise to do the opposite.
Let’s start at the top:
So far so good. There’s obvious reasons why it would make sense to focus exclusively on cases where an AI seems aligned to human observers and passes all relevant checks while still not actually being aligned. After all, in the cases where we can see the problem, we either fix it or at least iterate until we can’t see a problem any more (at which point we have “deception” by this definition).
But then the post immediately jumps to a far narrower definition of “deceptive alignment”:
If we look at e.g. the cases in Worlds Where Iterative Design Fails, most of them fit the more-general definition, yet none necessarily involve an AI which is “actively adversarial but pretends not to be”. And that’s exactly the sort of mistake people make when they focus exclusively on a single failure mode: they end up picturing a much narrower set of possibilities than the argument for focusing on that failure mode actually assumes.
Now, an oversight like that undermines the case for “focus only/mostly on deceptive alignment”, but doesn’t make it very actively harmful. The reason it’s actively harmful is unknown unknowns.
Unknown Unknowns
Claim: the single most confident prediction we can make about AGI is that there will be surprises. There will be unknown unknowns. There will be problems we do not currently see coming.
The thing which determines humanity’s survival will not be whether we solve alignment in whatever very specific world, or handful of worlds, we imagine to be most probable. What determines humanity’s survival will be whether our solutions generalize widely enough to handle the things the world actually throws at us, some of which will definitely be surprises.
How do we build solutions which generalize to handle surprises? Two main ways. First, understanding things deeply and thoroughly enough to enumerate every single assumption we’ve made within some subcomponent (i.e. mathematical proofs). That’s great when we can do it, but it will not cover everything or even most of the attack surface in practice. So, the second way to build solutions which generalize to handle surprises: plan for a wide variety of scenarios. Planning for a single scenario—like e.g. an inner agent emerging during training which is actively adversarial but pretends not to be—is a recipe for generalization failure once the worlds starts throwing surprises at us.
Psychologizing
At this point I’m going to speculate a bit about your own thought process which led to this post; obviously such speculation can easily miss completely and you should feel free to tell me I’m way off.
First, I notice that nowhere in this post do you actually compare deceptive alignment to anything else I’d consider an important-for-research alignment failure mode (like fast takeoff, capability gain in deployment, getting what we measure, etc). You just argue that (a rather narrow version of) deception is important, not that it’s more important than any of the other failure modes I actually think about. I also notice in the “implications” section:
What this sounds like to me is that you did not previously have any realistic model of how/why AI would be dangerous. I’m guessing that you were previously only thinking about problems which could be fixed by iterative design—i.e. seeing what goes wrong and then updating the design accordingly. Probably (a narrow version of) deception is the first scenario where you’ve realized that doesn’t work, and you haven’t yet thought of other ways for an iterative design cycle to fail to produce aligned AI.
So my advice would be to brainstorm other things “deception” (in the most general sense) could look like, or other ways the iterative design cycle could fail to produce aligned AI, and try to aim your brainstorming at scenarios which are as different as possible from the things you’ve already thought of.
To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.
However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn’t necessarily mean the literal loss, just anything we’re trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren’t solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we’re capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we’ll end up concluding yes (I’d probably put ~60% on that).
Furthermore, I think there’s a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it’s most valuable to start with the most concerning concrete failure modes we’re aware of. It’s extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.
That’s “relatively broad”??? What notion of “deceptive alignment” is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
This I agree with, but I think it doesn’t go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
Any definition that makes mention of the specific structure/internals of the model.
Thanks for the comment. It definitely pointed out some things that weren’t clear in my post and head. Some comments:
1. I think your section on psychologizing is fairly accurate. I previously didn’t spend a lot of time thinking about how my research would reduce the risks I care about and my theories of change were pretty vague. I plan to change that now.
2. I am aware of other failure modes such as fast takeoffs, capability gains in deployment, getting what we measure, etc. However, I feel like all of these scenarios get much worse/harder when deception is at play, e.g. fast takeoffs are worse when they are unnoticed and getting what we measure likely leads to worse outcomes if it is hidden. I would really think of them as orthogonal, e.g. getting what we measure could happen in a deceptive or non-deceptive way. But I’m not sure this is a correct framing.
3. It is correct that my definition of deception is inconsistent throughout the article. Thanks for pointing this out. I think it is somewhere between “It’s bad if something happens in powerful AIs that we don’t understand” to “It’s bad if there is an active adversary trying to deceive us”. I’ll need to think about this for longer.
4. Unknown unknowns are a problem. I think my claim as presented in the post is stronger than iI originally intended. However, I think the usefulness of foundational research such as yours comes to a large extent from the fact that it increases our understanding of an AI system in general which then allows us to prevent failure modes (many of which relate to deception).
I’ll try to update the post to reflect some of the discussion and my uncertainty better. Thanks for the feedback.
Something I’d add to “plan for a wide variety of scenarios” is to look for solutions that do not refer to those scenarios. A solution involving [and here we test for deceptive alignment (DA)] is going to generalise badly (even assuming such a test could work), but so too will a solution involving [and here we test for DA, x, y, z, w...].
This argues for not generating all our problem scenarios ahead of time: it’s useful to have a test set. If the solution I devise after only thinking about x and y also works for z and w, then I have higher confidence in it than if I’d generated x, y, z, w before I started looking for a solution.
For this reason, I’m not so pessimistic about putting a lot of effort into solving DA. I just wouldn’t want people to be thinking about DA-specific tests or DA-specific invariants.