DMs open.
Cleo Nardo
I don’t think that doubling the incomes of people working on existing projects would be a good use of resources.
Why should this be named for Emma Watson?
“experiences devoid of the friction” --> this is not what the life of wealthy/privileged people is like
“required to rear authentic understanding merited by experience”
People born wealthy/privileged have a good understanding of the world.
I’m suspicious of the ‘authentic’ modifier, and of ‘merited’.
Not confident at all.
I do think that safety researchers might be good at coordinating even if the labs aren’t. For example, safety researchers tend to be more socially connected, and also they share similar goals and beliefs.
Labs have more incentive to share safety research than capabilities research, because the harms of AI are mostly externalised whereas the benefits of AI are mostly internalised.
This includes extinction obviously, but also misuse and accidental harms which would cause industry-wide regulations and distrust.
Even a few safety researchers at the lab could reduce catastrophic risk.
The recent OpenAI-Anthropic collaboration is super good news. We should be giving them more cudos for this.
I think buying more crunch time is great.
While I’m not excited by pausing AI[1], I do support pushing labs to do more safety work between training and deployment.[2][3]
I think sharp takeoff speeds are scarier than short timelines.
I think we can increase the effective-crunch-time by deploying Claude-n to automate much of the safety work that must occur between training and deploying Claude-(n+1). But I don’t know if there’s any ways which accelerate Claude-n at safety work but not the capabilities work.
- ^
I think it’s an honorable goal, but seems infeasible given the current landscape.
- ^
- ^
Although I think the critical period for safety evals is between training and internal deployment, not training and external deployment. See Greenblatt’s Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
The Case against Mixed Deployment
The most likely way that things go very bad is conflict between AIs-who-care-more-about-humans and AIs-who-care-less-about-humans wherein the latter pessimize the former. There are game-theoretic models which predict this may happen, and the history of human conflict shows that these predictions bare out even when the agents are ordinary human-level intelligences who can’t read each other’s source-code.
My best guess is that the acausal dynamics between superintelligences shakes out well. But the causal dynamics between ordinary human-level AIs probably shakes out bad. This is my best case against mixed deployment.
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
- ^
i.e. the scaffold which monitors and modifies the activations, chains-of-thought, and tool use
IIRC I heard the “we’re spending months now to save ourselves days (or even hours) later” from the control guys, but I don’t know if they’d endorse the perspective I’ve outlined
Yep, I’ll bite the bullet here. This is a real problem and partly my motivation for writing the perspective explicitly.
I think people who are “in the know” are good at not over-updating on the quantitative results. And they’re good at explaining that the experiments are weak proxies which should be interpreted qualitatively at best. But people “out of the know” (e.g. junior ai safety researches) tend to overupdate and probably read the senior researchers as professing generic humility.
To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.
Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy
To clarify, this means you don’t hold the position I expressed. On the view I expressed, experiments using weak proxies are worthwhile even though they aren’t very informative
Bullshit was a poor choice of words. A better choice would’ve been “weak proxy”. On this view, this is still very worthwhile. See footnote.
Should we assure AIs we won’t read their scratchpad?
I’ve heard many people claim that it’s bad to assure an AI that you won’t look at its scratchpad if you intend to break that promise, especially if you later publish the content. The concern is that this content will enter the training data, and later AIs won’t believe our assurances.
I think this concern is overplayed.
We can modify the AIs beliefs. I expect some technique will be shown to work on the relevant AI, e.g.
Pretraining filtering
Gradient routing
Belief-inducing synthetic documents
Chain-of-thought resampling, i.e. every time the AI thinks about whether we can read the scratchpad, we simply rewind and resample
Activation steering, i.e. add the ‘believes humans won’t read scratchpad’ minus ‘believes humans will read scratchpad’ vector
We need to delude AIs anyway.
We want to know how the AI will behave when it believes it’s in a high-stakes deployment, before we actually deploy the AI in a high-stakes deployment. So we probably need to observe how the AI behaves when it falsely believes it’s in high-stakes deployment.
Replace “high-stakes deployment” with whatever you like, e.g. “low oversight”, “weights finished training”, “year is 2028″, “coup is underway”, “we don’t know whether this code has a backdoor”. I don’t think “we aren’t reading your scratchpad” is qualitatively different.
If we had some crazy mech interp, then we could predict how the AI would behave when it believes it’s in a high-stakes deployment without behavioural observations, but if we had crazy mech interp that then we wouldn’t need faithful scratchpads anyway.
Prosaic AI Safety research, in pre-crunch time.
Some people share a cluster of ideas that I think is broadly correct. I want to write down these ideas explicitly so people can push-back.
The experiments we are running today are kinda ‘
bullshit’[1] because the thing we actually care about doesn’t exist yet, i.e. ASL-4, or AI powerful enough that they could cause catastrophe if we were careless about deployment.The experiments in pre-crunch-time use pretty bad proxies.
90% of the “actual” work will occur in early-crunch-time, which is the duration between (i) training the first ASL-4 model, and (ii) internally deploying the model.
In early-crunch-time, safety-researcher-hours will be an incredible scarce resource.
The cost of delaying internal deployment will be very high: a billion dollars of revenue per day, competitive winner-takes-all race dynamics, etc.
There might be far fewer safety researchers in the lab than there currently are in the whole community.
Because safety-researcher-hours will be such a scarce resource, it’s worth spending months in pre-crunch-time to save ourselves days (or even hours) in early-crunch-time.
Therefore, even though the pre-crunch-time experiments aren’t very informative, it still makes sense to run them because they will slightly speed us up in early-crunch-time.
They will speed us up via:
Rough qualitative takeaways like “Let’s try technique A before technique B because in Jones et al. technique A was better than technique B.” However, the exact numbers in the Results table of Jones et al. are not informative beyond that.
The tooling we used to run Jones et al. can be reused for early-crunch-time, c.f. Inspect and TransformerLens.
The community discovers who is well-suited to which kind of role, e.g. Jones is good at large-scale unsupervised mech interp, and Smith is good at red-teaming control protocols.
Sometimes I use the analogy that we’re shooting with rubber bullets, like soldiers do before they fight a real battle. I think that might overstate how good the proxies are, it’s probably more like laser tag. But it’s still worth doing because we don’t have real bullets yet.
- ^
I want a better term here. Perhaps “practice-run research” or “weak-proxy research”?
On this perspective, the pre-crunch-time results are highly worthwhile. They just aren’t very informative. And these properties are consistent because the value-per-bit-of-information is so high.
You can give an AI a math problem, and it will respond within e.g. a couple of hours with a formal proof or disproof, as long as a human mathematician could have found an informal version of the proof in say 10 years.
Suppose you had a machine like this right now, and it cost $1 per query. How would you use it to improve things?
if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
What matters here is the chance that a particular model in a particular deployment plan will cause a particular catastrophe. And this is after the model has been trained and evaluated and redteamed and mech interped (imperfectly ofc). I don’t except such divergent probabilities from experts.
You can replace “deployment plan P1 has same chance of catastrophe as deployment plan P2” with “the safety team is equally worried by P1 and P2″.
In Pre-Crunch Time, here’s what I think we should be working on:
Problems that need to be solved in order that we can safely and usefully accelerate R&D, i.e. scalable oversight
Problems that will be hard to accelerate with R&D, e.g. very conceptual stuff, policy
Problems that you think, with a few years of grinding, could be reduced from hard-to-accelerate to easy-to-accelerate, e.g. ARC Theory
Notably, what ‘work’ looks like on these problems will depend heavily on which of the three buckets it belongs to.
Let’s say you want AIs in Early Crunch Time to solve your pet problem X, and you think the bottleneck will be your ability to oversee, supervise, direct, and judge the AI’s research output.
Here are two things you can do now, in 2025:
You can do X research yourself, so that you know more about X and can act like a grad supervisor.
You can work on scalable oversight = {cot monitoring, activation monitoring, control protocols, formal verification, debate} so you can deploy more powerful models for the same chance of catastrophe.
I expect (2) will help more with X. I don’t think being a grad supervisor will be a bottleneck for very long, compared with stuff like ‘we can’t deploy the more powerful AI bc our scalable oversight isn’t good enough to prevent catastrophe’. It also has the nice property that the gains will spread to everyone’s pet problem, not just your own.
+1
For training probes on a labelled dataset, you should train a probe for each layer and then pick whichever probe has the best training loss. Better yet, use a hold-out dataset, if you have enough data. When we did this on llama-3.3-70b, the best probe was layer 22⁄80.
Also, instead of probing only the last token, I think it’s better to probe every token and average the scores. This is because the scores are pretty noisy.
I would be very pleased if Anthropic and Openai maintained their collaboration.[1][2]
Even better: this arrangement could be formalised in a third-party (e.g. government org or non-profit) and other labs could be encouraged/obliged to join.
OpenAI’s findings from an alignment evaluation of Anthropic models
Anthropic’s findings from an alignment evaluation of of OpenAI’s models