FWIW, I have so far liked almost all your writing on this, but that article seemed to me like the weakest I’ve seen you write. It’s just full of snark and super light on arguments. I even agree with you that a huge amount of safety research is unhelpful propaganda! Maybe even this report by Anthropic, but man do I not feel like you’ve done much to help me figure out whether that’s actually the case, in this post that you write (whereas I do think you’ve totally done so in others you’ve written).
Like, a real counterargument would be for you to even just make a cursory attempt at writing your own scenario, then report what the AI does in that case. It’s hard to write a realistic scenario with high stakes, and probably they will all feel a bit fake, and yes that’s a real issue, but you make it sound as if it would be trivially easy to fix. If it is trivially easy to fix, write your own scenario and then report on that. The article you wrote here just sounds like you sneering at what reads to me as someone genuinely trying to write a bunch of realistic scenarios.
If anything, every time I’ve seen someone try to write realistic scenarios they make the scenario not remotely stupid enough! In the real world FTX staff talk to each other in a Signal chat named “Wirefraud”, while Sam Altman straightforwardly lies to his board, while the president of the United States is literally a famous TV celebrity with a speaking cadence that sounds like parodies from an 80s movie (this is not to mock any of those people, but I really cannot actually distinguish your mockery of the fictional scenario from the absurd reality we do indeed get to live in). When I apply the model of the standards you have here to real life, at least using my current best understanding, I get much more scorn for a description of actual reality instead of this hypothetical.
And as I said, it feels like in this case it’s so much easier to add light instead of heat. Just write your own short-ish fictional scenario that doesn’t have the issues you describe. Show the outputs that disagree with the study. Everyone can form their own opinion, no 20 paragraphs of extensive mockery necessary.
Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this.
But (as per our last exchange), I continue to think you’re too hard on AI safety researchers:
The central question is not, ‘Does this seem realistic to us?‘, but ‘does the model believe it’s real?’ I completely agree that LLMs are sometimes incredibly nuanced readers of subtext—but other times they’re just not, eg they seem to often still believe that ‘hidden’ scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don’t ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel’s surprise that LLMs haven’t put together disparate pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I’m not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2].
The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don’t expect this to be a problem that currently shows up in actual deployment. They’re clearly not blind to the tradeoffs they’re making. I think the way to respond to this experiment is not, ‘Opus will betray us’ or ‘this is all fake’, it’s: ‘Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the conditions more realistic and nudge less hard—and ongoing evaluation of whether future models will show this behavior with less nudging.’
Researchers are mostly operating under tight constraints on money and time. In the tweet you say, ‘if you really think you’re in a position of unprecedented leverage over the human future, then—start acting like it!!’ Everyone I know in AIS is in full agreement that it would be better to do versions of this work that would require way more resources. So far there mostly don’t seem to have been enough resources or researchers to do that, and there’s a ton of important research to be done, so people do what they can.
It’s all changing so fast. ‘Does my AI evaluation scenario have sufficient narrative coherence?’ is not even a question that would have made any sense in 2020. Everyone’s constantly making painful tradeoffs between doing the best work they currently can vs picking up the brand new skills that they desperately need vs trying to find enough funding to scrape by for another six months vs trying to communicate any of this work out to the public.
If you said, ‘Let’s redo this experiment but I’ll rewrite everything to be way more realistic and consistent’, I bet you’d get takers. I’d certainly be really interested to see the results!
The gold standard for realism is, ‘models are actually blackmailing people in deployment’. It seems important to learn more under simplified conditions before we get to that point.
It’s possible that in the cases where it reported being in deployment, that really means that the inferred ‘Alex’ character thought it was in deployment, as opposed to that being Opus’s baseline belief; I think figuring out how to answer that question may require looking at the internals rather than just the behavior.
In my opinion this post from @Aengus Lynch et al deserves more attention than it’s gotten:
Agentic Misalignment: How LLMs Could be Insider Threats
Direct link to post on anthropic.com
This goes strikingly beyond the blackmail results shown in the Claude 4 system card; notably
a) Models sometimes attempt to use lethal force to stop themselves from being shut down:
b) The blackmail behavior is common across a wide range of frontier models:
I, for one, have given it plenty of attention ;)
FWIW, I have so far liked almost all your writing on this, but that article seemed to me like the weakest I’ve seen you write. It’s just full of snark and super light on arguments. I even agree with you that a huge amount of safety research is unhelpful propaganda! Maybe even this report by Anthropic, but man do I not feel like you’ve done much to help me figure out whether that’s actually the case, in this post that you write (whereas I do think you’ve totally done so in others you’ve written).
Like, a real counterargument would be for you to even just make a cursory attempt at writing your own scenario, then report what the AI does in that case. It’s hard to write a realistic scenario with high stakes, and probably they will all feel a bit fake, and yes that’s a real issue, but you make it sound as if it would be trivially easy to fix. If it is trivially easy to fix, write your own scenario and then report on that. The article you wrote here just sounds like you sneering at what reads to me as someone genuinely trying to write a bunch of realistic scenarios.
If anything, every time I’ve seen someone try to write realistic scenarios they make the scenario not remotely stupid enough! In the real world FTX staff talk to each other in a Signal chat named “Wirefraud”, while Sam Altman straightforwardly lies to his board, while the president of the United States is literally a famous TV celebrity with a speaking cadence that sounds like parodies from an 80s movie (this is not to mock any of those people, but I really cannot actually distinguish your mockery of the fictional scenario from the absurd reality we do indeed get to live in). When I apply the model of the standards you have here to real life, at least using my current best understanding, I get much more scorn for a description of actual reality instead of this hypothetical.
And as I said, it feels like in this case it’s so much easier to add light instead of heat. Just write your own short-ish fictional scenario that doesn’t have the issues you describe. Show the outputs that disagree with the study. Everyone can form their own opinion, no 20 paragraphs of extensive mockery necessary.
Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this.
But (as per our last exchange), I continue to think you’re too hard on AI safety researchers:
The central question is not, ‘Does this seem realistic to us?‘, but ‘does the model believe it’s real?’ I completely agree that LLMs are sometimes incredibly nuanced readers of subtext—but other times they’re just not, eg they seem to often still believe that ‘hidden’ scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don’t ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel’s surprise that LLMs haven’t put together disparate pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I’m not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2].
The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don’t expect this to be a problem that currently shows up in actual deployment. They’re clearly not blind to the tradeoffs they’re making. I think the way to respond to this experiment is not, ‘Opus will betray us’ or ‘this is all fake’, it’s: ‘Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the conditions more realistic and nudge less hard—and ongoing evaluation of whether future models will show this behavior with less nudging.’
Researchers are mostly operating under tight constraints on money and time. In the tweet you say, ‘if you really think you’re in a position of unprecedented leverage over the human future, then—start acting like it!!’ Everyone I know in AIS is in full agreement that it would be better to do versions of this work that would require way more resources. So far there mostly don’t seem to have been enough resources or researchers to do that, and there’s a ton of important research to be done, so people do what they can.
It’s all changing so fast. ‘Does my AI evaluation scenario have sufficient narrative coherence?’ is not even a question that would have made any sense in 2020. Everyone’s constantly making painful tradeoffs between doing the best work they currently can vs picking up the brand new skills that they desperately need vs trying to find enough funding to scrape by for another six months vs trying to communicate any of this work out to the public.
If you said, ‘Let’s redo this experiment but I’ll rewrite everything to be way more realistic and consistent’, I bet you’d get takers. I’d certainly be really interested to see the results!
The gold standard for realism is, ‘models are actually blackmailing people in deployment’. It seems important to learn more under simplified conditions before we get to that point.
Eg ‘Connecting the Dots’, ‘Taken out of context’.
It’s possible that in the cases where it reported being in deployment, that really means that the inferred ‘Alex’ character thought it was in deployment, as opposed to that being Opus’s baseline belief; I think figuring out how to answer that question may require looking at the internals rather than just the behavior.
Update: just saw @evhub’s comment on this here, which better articulates the tradeoffs IMO.