I expect that we can make a false positive / false negative tradeoff.
So here you’re talking about situations where a false negative doesn’t have catastrophic consequences? Do you have reasons to believe catastrophic consequences are extremely unlikely beyond something like “it’s not clear to me how catastrophic consequences would happen”?
I agree we can find a reasonable balance (which will appear too conservative to most) to address non-catastrophic issues—but this doesn’t seem to help much. To the extent that it creates a false sense of security for more powerful systems, that’s bad.
E.g. this kind of language seems tailor-made for a false sense of security:
At some point, more than one AI developer will have strong, practical, field-tested RSPs. Their practices (and the people who have been implementing them) will then become immensely valuable resources for broader evals-based rules and norms.
Everything’s fine: we have “strong, practical, field-tested RSPs”! (of course people shouldn’t conclude that everything’s fine—but they could be forgiven for asking why we weren’t much clearer about the inadequacy of such RSPs ahead of time)
I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us
We don’t need “deliberately deceive us” for something as bad as deceptive alignment. We only need to be robustly mistaken, not deliberately deceived.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn’t [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that’s enough for us to be screwed. (of course I expect something like [both “have influence over adequate resources” and “find better ways to aim for x” to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it’s not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don’t claim an example of this particular form is probable.
I do claim that knowing there’s no deliberate deception is insufficient—and more generally that we’ll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
Are you 95% on [by … we can sidestep all problems of the form [this thing looked aligned safe according to tests we believed were thorough, but later it killed us anyway]]? (oh I meant ‘safe’ the first time, not ‘aligned’, apologies) If not, sidestepping deception doesn’t look particularly important. If so, I remain confused by your level of confidence.
So here you’re talking about situations where a false negative doesn’t have catastrophic consequences?
No, we’ll have to make false positive / false negative tradeoffs about ending the world as well. We’re unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn’t [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that’s enough for us to be screwed. (of course I expect something like [both “have influence over adequate resources” and “find better ways to aim for x” to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it’s not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don’t claim an example of this particular form is probable.
I do claim that knowing there’s no deliberate deception is insufficient—and more generally that we’ll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting—being able to search for what states of self-reflection would encourage them to display high capabilities.
I’m somewhat more confident in our ability to think of things to test for. If an AI has “that spark of generality,” it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
If not, sidestepping deception doesn’t look particularly important. If so, I remain confused by your level of confidence.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
For architectures that might learn this kind of behavior (ones that do self-reflection during inference)
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
a big quote of Habryka warning about deceptive alignment
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))
So here you’re talking about situations where a false negative doesn’t have catastrophic consequences? Do you have reasons to believe catastrophic consequences are extremely unlikely beyond something like “it’s not clear to me how catastrophic consequences would happen”?
I agree we can find a reasonable balance (which will appear too conservative to most) to address non-catastrophic issues—but this doesn’t seem to help much. To the extent that it creates a false sense of security for more powerful systems, that’s bad.
E.g. this kind of language seems tailor-made for a false sense of security:
Everything’s fine: we have “strong, practical, field-tested RSPs”!
(of course people shouldn’t conclude that everything’s fine—but they could be forgiven for asking why we weren’t much clearer about the inadequacy of such RSPs ahead of time)
We don’t need “deliberately deceive us” for something as bad as deceptive alignment.
We only need to be robustly mistaken, not deliberately deceived.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn’t [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that’s enough for us to be screwed. (of course I expect something like [both “have influence over adequate resources” and “find better ways to aim for x” to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it’s not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don’t claim an example of this particular form is probable.
I do claim that knowing there’s no deliberate deception is insufficient—and more generally that we’ll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
Are you 95% on [by … we can sidestep all problems of the form [this thing looked
alignedsafe according to tests we believed were thorough, but later it killed us anyway]]? (oh I meant ‘safe’ the first time, not ‘aligned’, apologies)If not, sidestepping deception doesn’t look particularly important.
If so, I remain confused by your level of confidence.
No, we’ll have to make false positive / false negative tradeoffs about ending the world as well. We’re unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting—being able to search for what states of self-reflection would encourage them to display high capabilities.
I’m somewhat more confident in our ability to think of things to test for. If an AI has “that spark of generality,” it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
Thanks, that’s clarifying.
A couple of small points:
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))