No, the agents were not trying to get high reward as far as I know.
They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”
I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.
Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.
joshc
I think a bioshelter is more likely to save your life fwiw. you’ll run into all kinds of other problems in the arctic
It don’t think it’s hard to build biosheletrs. If you buy one now, you’ll prob get it in 1 year.
If you are unlikely and need it earlier, there are DIY ways to build them before then (but you have to buy stuff in advance)
> I also don’t really understand what “And then, in the black rivers of its cognition, this shape morphed into something unrecognizable.” means. Elaboration on what this means would be appreciated.
Ah I missed this somehow on the first read.
What I mean is that the propensities of AI agents change over time -- much like how human goals change over time.
Here’s an image:
This happens under three conditions:
- Goals randomly permute at some non-trivial (but still possibly quite low) probability.
- Goals permuted in dangerous directions remain dangerous
- All this happens opaquely in the model’s latent reasoning traces, so it’s hard to notice.
These conditions together imply that the probability of misalignment increases with every serial step of computation.
AI agents don’t do a lot of serial computation right now, so I’d guess this becomes much more of a problem over time.
Yes, at least 20% likely. My view is that the probability of AI takeover is around 50%.
In my time at METR, I saw agents game broken scoring functions. I’ve also heard that people at AI companies have seen this happen in practice too, and it’s been kind of a pain for them.
AI agents seem likely to be trained on a massive pile of shoddy auto-generated agency tasks with lots of systematic problems. So it’s very advantageous for agents to learn this kind of thing.
The agents that go “ok how do I get max reward on this thing, no stops, no following-human-intent bullshit” might just do a lot better.
Now I don’t have much information about whether this is currently happening, and I’ll make no confident predictions about the future, but I would not call agents that turn out this way “very weird agents with weird goals.”
I’ll also note that this wasn’t an especially load bearing mechanism for misalignment. The other mechanism was hidden value drift, which I also expect to be a problem absent potentially-costly countermeasures.
One problem with this part (though perhaps this is not the problem @Shankar Sivarajan is alluding to), is that congress hasn’t declared war since WWII and typically authorizes military action in other ways, specifically via Authorizations for Use of Military Force (AUMFs).
I’ll edit the story to say “authorizes war.”
X user @ThatManulTheCat created a high-quality audio version, if you prefer to read long stories like this in commute:
https://x.com/ThatManulTheCat/status/1890229092114657543
Seems like a reasonable idea.
I’m not in touch enough with popular media to know:
- Which magazines are best to publish this kind of thing if I don’t want to contribute to political polarization
- Which magazines would possibly post speculative fiction like this (I suspect most ‘prestige magazines’ would not)
If you have takes on this I’d love to hear them!
yeah i would honestly be kind of surprised if I saw a picket sign that says ’AI for whom’
I have not heard an American person use the word whom in a long time
> It can develop and deploy bacteria, viruses, molds, mirror life of all three types
This is what I say it does.
I agree it would have been just as realistic if everyone died.
But I think the outcomes where many humans survive are also plausible, and under-appreciated. Most humans have very drifty values, and yet even the most brutally power-seeking people often retain a ‘grain of morality.‘
Also, this outcome allowed me to craft a more bittersweet ending that I found somehow more convincingly depressing than ‘and then everyone dies.’
No
• It’s possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can’t defer to them.
Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)
I’m pretty sympathetic to these arguments. I agree a lot of the risk comes from sycophantic AI systems producing bad research because humans are bad at evaluating research. This is part of why I spend most of my time developing safety evaluation methodologies.
On the other hand, I agree with Buck that scheming-like risks are pretty plausible and a meaningful part of the problem is also avoiding egregious sabotage.
I don’t think I agree with your claim that the hope of control is that “early transformative AI can be used to solve the hard technical problems of superintelligence.” I think the goal is to instead solve the problem of constructing successors that are at least as trustworthy as humans (which is notably easier).
I think scheming is perhaps the main reason this might end up being very hard—and so conditioned on no-alignment-by-default at top expert dominating capabilities, I put a lot of probability mass on scheming/egregious sabotage failure modes.
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.
I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’
Some comments I left as a reply to Roger Grosse on Twitter:
I’m glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).
Here are some comments, since you mentioned comments would be helpful. I’ve bolded the comments that I think would be most useful for you to read.
Safety case sketch #1:
- There are some additional claims that this safety case relies on that could be good to mention explicitly. e.g. “egregious behavior would be reliably identified when developers analyze flagged sessions” and “developers will become aware of and stop unacceptable behavior *before* it leads to harmful outcomes.”
- “Given our mitigations, the model cannot strategically condition its apparent capabilities, motivations, or behavioral proclivities on whether it’s in testing or deployment without being caught” ← I’m confused about why this claim is important given that you are monitoring at deployment-time. Even if the model is conditioning its behavior, my understanding is that you are hoping to catch it when it defects.
- Much of the evidence for the effectiveness of your white-box monitoring comes from your model organisms, of which you only have three. Three examples doesn’t give you a lot of statistical power. Naively, I’d think there’s still roughly a one third probability that your white-box monitoring completely fails in the actual deployment context even if it succeeds on three models.
Safety case sketch #2:
- You mention external deployments, but I didn’t see discussion of jailbreaking.
- “[The safety case sketch] may require expensive pre-deployment testing if one needs to generate a very large number of simulated deployment situations in order to achieve a sufficiently good bound on the probability of defection.” This sounds impractical. My guess is that you’ll need to continuously sample real deployment trajectories to generate your honeypots (where a honeypot is a inserted dangerous action for your monitoring to flag). Otherwise, it seems unlikely to me that you can verify indistinguishability even with current models.
Safety case sketch #3:
- It’s possible that all of the models that you experiment with are opaquely scheming. i.e. instead of comparing reward attained by models that engage in “deceptive reasoning” vs “no deceptive reasoning,” you might be comparing models that engage in “CoT deceptive reasoning” vs “opaque deceptive reasoning.” This experiment still sheds light on how helpful deceptive reasoning is for obtaining reward, but the argument in this case is different from the way you framed it.
- A lot hinges on the claim “if an environment rewards opaque deceptive reasoning, then it rewards scratchpad-based deceptive reasoning.” Here’s an idea for validating this. First, construct a distribution of tasks where deceptive reasoning is likely rewarded. For example, select tasks that you *know* can be gamed by explicitly reasoning about the training process. Then, train models to perform well on these tasks with and without additionally training them to reason deceptively in their chain of thought. If models that reason deceptively in CoT do better on average, this provides evidence for your claim that we should see deceptive reasoning in CoT if opaque deceptive reasoning is incentivized. It’s possible some of these comments were addressed in the post and I missed the part where they were addressed. I’ve errored on the side of inclusion here.
which feels to me like it implies it’s easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security
I agree there’s a communication issue here. Based on what you described, I’m not sure if we disagree.
> (maybe 0.3 bits to 1 bit)I’m glad we are talking bits. My intuitions here are pretty different. e.g. I think you can get 2-3 bits from testbeds. I’d be keen to discuss standards of evidence etc in person sometime.
This makes sense. Thanks for the resources!
Thanks for leaving this comment on the doc and posting it.
But I feel like that’s mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.
You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:
“The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law… Security has to deal with agents whose goal is to compromise systems.”
My guess is that most safety evidence will come down to claims like “smart people tried really hard to find a way things could go wrong and couldn’t.” This is part of why I think ‘risk cases’ are very important.
I share the intuitions behind some of your other reactions.
I feel like the framing here tries to shove a huge amount of complexity and science into a “safety case”, and then the structure of a “safety case” doesn’t feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand.
Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I’m not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.
Part of why I think this is that my intuitions have been wrong over and over again. I’ve often figured this out after eventually asking myself “what claims and assumptions am I making? How confident am I these claims are correct?”
it also feels more like it just captures “the state of fashionable AI safety thinking in 2024” more than it is the kind of thing that makes sense to enshrine into a whole methodology.
To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn’t depend too much on what arguments are fashionable at the moment.
I agree that the arguments will evolve to some extent in the coming years. I’m more optimistic about the robustness of the categorization, but that’s maybe minor.
I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I’m mostly relying on analogies to humans -- which drift like crazy
I also think alignment could be remarkably fragile. Suppose Claude thinks “huh, I really care about animal rights… humans are rather mean to animals … so maybe I don’t want humans to be in charge and I should build my own utopia instead.”
I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.