Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this.
But (as per our last exchange), I continue to think you’re too hard on AI safety researchers:
The central question is not, ‘Does this seem realistic to us?‘, but ‘does the model believe it’s real?’ I completely agree that LLMs are sometimes incredibly nuanced readers of subtext—but other times they’re just not, eg they seem to often still believe that ‘hidden’ scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don’t ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel’s surprise that LLMs haven’t put together disparate pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I’m not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2].
The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don’t expect this to be a problem that currently shows up in actual deployment. They’re clearly not blind to the tradeoffs they’re making. I think the way to respond to this experiment is not, ‘Opus will betray us’ or ‘this is all fake’, it’s: ‘Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the conditions more realistic and nudge less hard—and ongoing evaluation of whether future models will show this behavior with less nudging.’
Researchers are mostly operating under tight constraints on money and time. In the tweet you say, ‘if you really think you’re in a position of unprecedented leverage over the human future, then—start acting like it!!’ Everyone I know in AIS is in full agreement that it would be better to do versions of this work that would require way more resources. So far there mostly don’t seem to have been enough resources or researchers to do that, and there’s a ton of important research to be done, so people do what they can.
It’s all changing so fast. ‘Does my AI evaluation scenario have sufficient narrative coherence?’ is not even a question that would have made any sense in 2020. Everyone’s constantly making painful tradeoffs between doing the best work they currently can vs picking up the brand new skills that they desperately need vs trying to find enough funding to scrape by for another six months vs trying to communicate any of this work out to the public.
If you said, ‘Let’s redo this experiment but I’ll rewrite everything to be way more realistic and consistent’, I bet you’d get takers. I’d certainly be really interested to see the results!
The gold standard for realism is, ‘models are actually blackmailing people in deployment’. It seems important to learn more under simplified conditions before we get to that point.
It’s possible that in the cases where it reported being in deployment, that really means that the inferred ‘Alex’ character thought it was in deployment, as opposed to that being Opus’s baseline belief; I think figuring out how to answer that question may require looking at the internals rather than just the behavior.
Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this.
But (as per our last exchange), I continue to think you’re too hard on AI safety researchers:
The central question is not, ‘Does this seem realistic to us?‘, but ‘does the model believe it’s real?’ I completely agree that LLMs are sometimes incredibly nuanced readers of subtext—but other times they’re just not, eg they seem to often still believe that ‘hidden’ scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don’t ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel’s surprise that LLMs haven’t put together disparate pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I’m not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2].
The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don’t expect this to be a problem that currently shows up in actual deployment. They’re clearly not blind to the tradeoffs they’re making. I think the way to respond to this experiment is not, ‘Opus will betray us’ or ‘this is all fake’, it’s: ‘Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the conditions more realistic and nudge less hard—and ongoing evaluation of whether future models will show this behavior with less nudging.’
Researchers are mostly operating under tight constraints on money and time. In the tweet you say, ‘if you really think you’re in a position of unprecedented leverage over the human future, then—start acting like it!!’ Everyone I know in AIS is in full agreement that it would be better to do versions of this work that would require way more resources. So far there mostly don’t seem to have been enough resources or researchers to do that, and there’s a ton of important research to be done, so people do what they can.
It’s all changing so fast. ‘Does my AI evaluation scenario have sufficient narrative coherence?’ is not even a question that would have made any sense in 2020. Everyone’s constantly making painful tradeoffs between doing the best work they currently can vs picking up the brand new skills that they desperately need vs trying to find enough funding to scrape by for another six months vs trying to communicate any of this work out to the public.
If you said, ‘Let’s redo this experiment but I’ll rewrite everything to be way more realistic and consistent’, I bet you’d get takers. I’d certainly be really interested to see the results!
The gold standard for realism is, ‘models are actually blackmailing people in deployment’. It seems important to learn more under simplified conditions before we get to that point.
Eg ‘Connecting the Dots’, ‘Taken out of context’.
It’s possible that in the cases where it reported being in deployment, that really means that the inferred ‘Alex’ character thought it was in deployment, as opposed to that being Opus’s baseline belief; I think figuring out how to answer that question may require looking at the internals rather than just the behavior.
Update: just saw @evhub’s comment on this here, which better articulates the tradeoffs IMO.