For architectures that might learn this kind of behavior (ones that do self-reflection during inference)
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
a big quote of Habryka warning about deceptive alignment
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))
Thanks, that’s clarifying.
A couple of small points:
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))