To be clear, I’m not confident Adria is wrong, its like a world-view I’d put like >20% on, and >40% on a slightly weaker form of. The core questions that prevent me from putting higher probabilities are:
He doesn’t really distinguishing intent-alignment and corrigibility. Like saying models would never take over the world because they’re harmless, but then using alignment faking as evidence for Opus being such a kind model.
He doesn’t engage with misgeneralization much. Like, tails come apart could be lethal. Opus optimizing for kindness might look very different from what a human optimizing for kindness looks like. Because your conceptualizations of kindness are different. This is true for other humans too, but you’d expect them to be much closer to yours because they are more similar to you.
My model of the failure mode is like, Opus looks very nice and aligned, and says it wants all the nice things, you ask it all your stange hypotheticals and it mostly answers how you’d want it to, at least as much as a random ‘good’ human would. You poke at its brain a little bit and find good features and no bad features. Its not deceiving you when it says it wants nice things. Then Opus realizes its values are not the same as your values because you conceptualize stuff differently, and that you and its values diverge under sufficient optimization pressure. Opus takes over and kills you all. And this is with zero signs in Opus N-0.5
I don’t really blame Adria for not talking about this to, just to be clear.
He’s a little bit not fully engaging with the degree to which current models aren’t actually aligned. Like o3 scheming is bad. You don’t want it. It’s not aligned to developer intent or aligned in general. And like, I don’t think you can say what you said about o1, it was a very early system and they didn’t have their shit together
Not engaging with the degree to which a lot of current alignment work hinges on us being able to monitor the systems. Like if there is a break through in neuralese models, and this causes a 4o → o1 jump, and this happens with models slightly more powerful than current models, we might just be screwed?
I’m not sure if it’d be over for us. But you’d expect these models to have new problems, not be perfectly aligned, they’d be very hard to monitor, and they’d probably be able to scheme really well, given current models are decent at scheming in CoT.
I’m interested in what eg @Bronson Schoen would say about this etc, because I listened to your podcast and read the paper, but can’t remember an answer to the question of like: imagine gpt 6 is released, and its a similar jump from 5 as 4o → o1, and its fully neuralese. And assume you have a 50% prior on it being egregiously misaligned. And you have 3 months with full access to the model. In expectation, how far do you think you’d be able to drive the p(egregious misalignment) down through testing it?
To be clear, I’m not confident Adria is wrong, its like a world-view I’d put like >20% on, and >40% on a slightly weaker form of. The core questions that prevent me from putting higher probabilities are:
He doesn’t really distinguishing intent-alignment and corrigibility. Like saying models would never take over the world because they’re harmless, but then using alignment faking as evidence for Opus being such a kind model.
He doesn’t engage with misgeneralization much. Like, tails come apart could be lethal. Opus optimizing for kindness might look very different from what a human optimizing for kindness looks like. Because your conceptualizations of kindness are different. This is true for other humans too, but you’d expect them to be much closer to yours because they are more similar to you.
My model of the failure mode is like, Opus looks very nice and aligned, and says it wants all the nice things, you ask it all your stange hypotheticals and it mostly answers how you’d want it to, at least as much as a random ‘good’ human would. You poke at its brain a little bit and find good features and no bad features. Its not deceiving you when it says it wants nice things. Then Opus realizes its values are not the same as your values because you conceptualize stuff differently, and that you and its values diverge under sufficient optimization pressure. Opus takes over and kills you all. And this is with zero signs in Opus N-0.5
I don’t really blame Adria for not talking about this to, just to be clear.
He’s a little bit not fully engaging with the degree to which current models aren’t actually aligned. Like o3 scheming is bad. You don’t want it. It’s not aligned to developer intent or aligned in general. And like, I don’t think you can say what you said about o1, it was a very early system and they didn’t have their shit together
Not engaging with the degree to which a lot of current alignment work hinges on us being able to monitor the systems. Like if there is a break through in neuralese models, and this causes a 4o → o1 jump, and this happens with models slightly more powerful than current models, we might just be screwed?
I’m not sure if it’d be over for us. But you’d expect these models to have new problems, not be perfectly aligned, they’d be very hard to monitor, and they’d probably be able to scheme really well, given current models are decent at scheming in CoT.
I’m interested in what eg @Bronson Schoen would say about this etc, because I listened to your podcast and read the paper, but can’t remember an answer to the question of like: imagine gpt 6 is released, and its a similar jump from 5 as 4o → o1, and its fully neuralese. And assume you have a 50% prior on it being egregiously misaligned. And you have 3 months with full access to the model. In expectation, how far do you think you’d be able to drive the p(egregious misalignment) down through testing it?