I’ve spent the last 4 years working on AI safety. On paper, it’s gone well. Here’s what actually happened.
1. I became what I wanted to prevent
At some point, I looked up and realized I had almost become a paper-clipper optimizing for one objective. Working at some point 80-hour weeks. Telling myself the stakes justify it. Sacrificing jazz improvisation on the piano for one more strategic doc, and realizing one day that fingers had forgotten how to play.
Yes, the compounding effect of going faster is real—but I think there is a difference between going faster and going further.
The first reason is that preserving slack is vital in the long run, as Richard Hamming says: “I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don’t quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.”
The second reason is more personal. One of my friends at the time advised me to slow down. In the beginning I considered him quite lazy. But in fact he was right about something I couldn’t see at the time: I forgot why I cared in the first place.
My father is an activist. He fights for causes that don’t resonate with me. There’s a growing gap between us. But every few weeks, I call him, and I stay on the line even when the conversation goes nowhere. If I can’t even preserve a connection with my own father, what business do I have claiming I’m working to save humanity?
I’d love to say I’ve completely fixed this, but unlearning is not an open problem just in AI.
2. I didn’t think much about the actual risk
I was giving a talk at a workshop in Paris. Risk models in the first half, interpretability research in the second. Someone raised their hand and asked: “I don’t understand, what’s the point of doing this?”
I froze. I didn’t have a real answer besides “interpretability helps get a better understanding, but yeah”—I was not really convinced by my answer.
For months, I had been telling people “yes, you can work on interp.” But I had never seriously asked myself: if an AI catastrophe happens, what’s the chain of events? And does this break any link in that chain? (That’s not necessarily a criticism of interpretability research, but mostly a criticism of how I was engaging with it.)
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
Many of us in AI safety don’t reason backward from the actual threat models because it’s uncomfortable; it reveals how uncertain everything is. But I’m convinced this is how the most useful work gets done. Ask yourself: how does this actually mitigate AI risks? Sometimes, you’ll need to stare at the abyss and pivot. I’d even say it would be suspicious to never pivot. For me, that meant stepping away from technical research to focus on policy and governance, which, in my position, is my current best guess.
3. I was confident about my strategy. I still changed it a dozen times.
When it comes to most people and orgs in this space, I think their strategy is suboptimal. But they probably think the same about me. If everyone in a field thinks everyone else is wrong, that’s strong evidence that being super confident about your own strategy is not a good move.
Exactly two years ago, I tweeted that AI evaluations might be net negative: high opportunity costs, often safety-washing risks because no company was ever forced in any way as a result of external evaluations. In practice, evals have never blocked, postponed or constrained a deployment. I argued that without strict red lines, evals risk becoming a slippery slope of safety-washing.
The EU AI Act finally introduces those legal boundaries. Suddenly, evals have teeth (at least on paper). That’s why today, my org conducts evaluations for the Act. I went from tweeting they were probably pointless to making them part of our mission.[1]
So many ways to be too confident. So many second-order effects that matter more than the apparent first-order ones.[2] The strategy that felt airtight one year ago looks quite weak today. I hope I don’t look back at those years by just saying: “You know what, at least I’ve learnt something”
And yet, you have to commit. You can’t be paralyzed. At some point you have to execute with conviction. But I wish more people scheduled regular moments to genuinely try to destroy their own thesis. Today, I’m more humble.
—
Utilitarianism told me that what I was giving up didn’t matter because the stakes were high enough. It was a clean story, but it is not healthy in the long run. I believe that what actually works is simpler: try to be a good person, reflect from time to time, and do good work.[3]
Don’t throw your mind away, and don’t surrender your humanity.
Honestly you would be surprised at the immensity of the gap between what think tanks apparently do, why they seem to do it, and what they actually do and why.
As a researcher, there’s kinda a stack of “what I’m trying to do”, from the biggest picture to the most microscopic task. Like here’s a typical “stack trace” of what I might be doing on a random morning:
LEVEL 5: I’m trying to ensure a good future for life
LEVEL 1: …by reading a bunch of articles about the nucleus incertus
So as researchers, we face a practical question: How do we allocate our time between the different levels of the stack? If we’re 100% at the bottom level, we run a distinct risk of “losing the plot”, and working on things that won’t actually help advance the higher levels. If we’re 100% at the top level, with our head way up in the clouds, never drilling down into details, then we’re probably not learning anything or making any progress.
Obviously, you want a balance.
And I’ve found that striking that balance properly isn’t something that takes care of itself by default. Instead, my default is to spend too much time at the bottom of the stack and not enough time higher up.
So to counteract that tendency, I have for many months now had a practice of “Solve The Whole Problem Day”. That’s one day a week (typically Friday) where I force myself to take a break from whatever detailed things I would otherwise be working on, and instead I fly up towards the top of the stack, and try to see what I’m missing, question my assumptions, find new avenues to explore, etc.
In my case, “The Whole Problem” = “The Whole Safe & Beneficial AGI Problem”. For you, it might be The Whole Climate Change Problem, or The Whole Animal Suffering Problem, or The Whole Becoming A Billionaire Problem, or whatever. (If it’s not obvious how to fill in the blank, well then you especially need a Solve The Whole Problem Day! And maybe start here & here & here.)
It’s a great explication-plus-habit-implementation for “keeping your eye on the ball”. Clarifying one’s personal view of the “stack” also just seems good more broadly, cf. Dave Banerjee’s archetype of “a large fraction of [the] researchers in AI safety/governance fellowships [he’s had 1-1s with]”:
My guess is that spending time clarifying and re-clarifying the stack isn’t a dispositionally preferable thing for most folks who end up doing frontier-pushing research. Anecdotally, when I got interested in cost-effectiveness analysis for improving decision-making a few years ago and started reaching out to experts whose public work I respected, coming from a “business intelligence” corporate background where analyses were always in contact with all kinds of business decisions small-to-large and fast-turnaround operational to slow strategy, I was struck by the disparity between their obsessive interest in the research & analysis part and their diplomatically-couched near-indifference to how their analysis changed any decisions whatsoever. It was jarring; it made me decide not to be like them, or work in roles that incentivised this.
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
You have managed to link to RogerDearnaley’s comment which seems to disprove your point. The main theory of impact of interpretability is the potential ability to tell apart aligned AIs and misaligned ones. If we lose this ability (e.g. because the capabilities race causes a lab to train neuralese AIs or because the AIs avoid stating their goals in the CoT), then misaligned AIs proceed to reach the ASI and to take over.
But mankind saw Anthropic state on page 55 of Claude Mythos’ system card that “White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.” I expect that applying similar techniques would likely increase the chance that the humans learn about more destructive actions of the AIs, like Agent-4 sandbagging on alignment R&D.
As for the impact of evals, I would like insiders from Anthropic to comment on your point. As far as I understand, Anthropic never releases models without thoroughly evaluating them and describing the results. What would Anthropic do with a counterfactual result of Claude Mythos seeking power?
4 years of AI safety: what I got wrong
I’ve spent the last 4 years working on AI safety. On paper, it’s gone well. Here’s what actually happened.
1. I became what I wanted to prevent
At some point, I looked up and realized I had almost become a paper-clipper optimizing for one objective. Working at some point 80-hour weeks. Telling myself the stakes justify it. Sacrificing jazz improvisation on the piano for one more strategic doc, and realizing one day that fingers had forgotten how to play.
Yes, the compounding effect of going faster is real—but I think there is a difference between going faster and going further.
The first reason is that preserving slack is vital in the long run, as Richard Hamming says: “I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don’t quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.”
The second reason is more personal. One of my friends at the time advised me to slow down. In the beginning I considered him quite lazy. But in fact he was right about something I couldn’t see at the time: I forgot why I cared in the first place.
My father is an activist. He fights for causes that don’t resonate with me. There’s a growing gap between us. But every few weeks, I call him, and I stay on the line even when the conversation goes nowhere. If I can’t even preserve a connection with my own father, what business do I have claiming I’m working to save humanity?
I’d love to say I’ve completely fixed this, but unlearning is not an open problem just in AI.
2. I didn’t think much about the actual risk
I was giving a talk at a workshop in Paris. Risk models in the first half, interpretability research in the second. Someone raised their hand and asked: “I don’t understand, what’s the point of doing this?”
I froze. I didn’t have a real answer besides “interpretability helps get a better understanding, but yeah”—I was not really convinced by my answer.
For months, I had been telling people “yes, you can work on interp.” But I had never seriously asked myself: if an AI catastrophe happens, what’s the chain of events? And does this break any link in that chain? (That’s not necessarily a criticism of interpretability research, but mostly a criticism of how I was engaging with it.)
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
Many of us in AI safety don’t reason backward from the actual threat models because it’s uncomfortable; it reveals how uncertain everything is. But I’m convinced this is how the most useful work gets done. Ask yourself: how does this actually mitigate AI risks? Sometimes, you’ll need to stare at the abyss and pivot. I’d even say it would be suspicious to never pivot. For me, that meant stepping away from technical research to focus on policy and governance, which, in my position, is my current best guess.
3. I was confident about my strategy. I still changed it a dozen times.
When it comes to most people and orgs in this space, I think their strategy is suboptimal. But they probably think the same about me. If everyone in a field thinks everyone else is wrong, that’s strong evidence that being super confident about your own strategy is not a good move.
Exactly two years ago, I tweeted that AI evaluations might be net negative: high opportunity costs, often safety-washing risks because no company was ever forced in any way as a result of external evaluations. In practice, evals have never blocked, postponed or constrained a deployment. I argued that without strict red lines, evals risk becoming a slippery slope of safety-washing.
The EU AI Act finally introduces those legal boundaries. Suddenly, evals have teeth (at least on paper). That’s why today, my org conducts evaluations for the Act. I went from tweeting they were probably pointless to making them part of our mission.[1]
So many ways to be too confident. So many second-order effects that matter more than the apparent first-order ones.[2] The strategy that felt airtight one year ago looks quite weak today. I hope I don’t look back at those years by just saying: “You know what, at least I’ve learnt something”
And yet, you have to commit. You can’t be paralyzed. At some point you have to execute with conviction. But I wish more people scheduled regular moments to genuinely try to destroy their own thesis. Today, I’m more humble.
—
Utilitarianism told me that what I was giving up didn’t matter because the stakes were high enough. It was a clean story, but it is not healthy in the long run. I believe that what actually works is simpler: try to be a good person, reflect from time to time, and do good work.[3]
Don’t throw your mind away, and don’t surrender your humanity.
To be fair, some people I respect still think the eval regime might be negative for safety https://cognition.cafe/p/why-ai-evaluation-regimes-are-bad
Honestly you would be surprised at the immensity of the gap between what think tanks apparently do, why they seem to do it, and what they actually do and why.
For a more theoretical explanation of why I’m no longer purely utilitarian.
Your #2, and to a lesser extent #3, reminded me of Steve Byrnes’s Research productivity tip: “Solve The Whole Problem Day”, whose intro I sometimes share with friends:
It’s a great explication-plus-habit-implementation for “keeping your eye on the ball”. Clarifying one’s personal view of the “stack” also just seems good more broadly, cf. Dave Banerjee’s archetype of “a large fraction of [the] researchers in AI safety/governance fellowships [he’s had 1-1s with]”:
My guess is that spending time clarifying and re-clarifying the stack isn’t a dispositionally preferable thing for most folks who end up doing frontier-pushing research. Anecdotally, when I got interested in cost-effectiveness analysis for improving decision-making a few years ago and started reaching out to experts whose public work I respected, coming from a “business intelligence” corporate background where analyses were always in contact with all kinds of business decisions small-to-large and fast-turnaround operational to slow strategy, I was struck by the disparity between their obsessive interest in the research & analysis part and their diplomatically-couched near-indifference to how their analysis changed any decisions whatsoever. It was jarring; it made me decide not to be like them, or work in roles that incentivised this.
Thanks for sharing, I wasn’t aware of those posts from Steve Byrnes and Dave Banerjee, and they are quite on point!
You have managed to link to RogerDearnaley’s comment which seems to disprove your point. The main theory of impact of interpretability is the potential ability to tell apart aligned AIs and misaligned ones. If we lose this ability (e.g. because the capabilities race causes a lab to train neuralese AIs or because the AIs avoid stating their goals in the CoT), then misaligned AIs proceed to reach the ASI and to take over.
But mankind saw Anthropic state on page 55 of Claude Mythos’ system card that “White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.” I expect that applying similar techniques would likely increase the chance that the humans learn about more destructive actions of the AIs, like Agent-4 sandbagging on alignment R&D.
As for the impact of evals, I would like insiders from Anthropic to comment on your point. As far as I understand, Anthropic never releases models without thoroughly evaluating them and describing the results. What would Anthropic do with a counterfactual result of Claude Mythos seeking power?