Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production “RL models” we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.
Out of curiosity, have your takes here changed much lately?
I think the o3+ saga has updated me a small-medium amount toward “companies will just deploy misaligned AIs and consumers will complain but use them anyway” (evidenced by deployment of models that blatantly lie from multiple companies) and “slightly misaligned AI systems that are very capable will likely be preferred over more aligned systems that are less capable” (evidenced by many consumers, including myself, switching over to using these more capable lying models).
I also think companies will work a bit to reduce reward hacking and blatant lying, and they will probably succeed to some extent (at least for noticeable, everyday problems), in the next few months. That, combined with OpenAI’s rollback of 4o sycophancy, will perhaps make it seem like companies are responsive to consumer pressure here. But I think the situation is overall a small-medium update against consumer pressure doing the thing you might hope here.
Side point: Noting one other dynamic: advanced models are probably not going to act misaligned in everyday use cases (that consumers have an incentive to care about, though again revealed preference is less clear), even if they’re misaligned. That’s the whole deceptive alignment thing. So I think it does seem more like the ESG case?
I agree that the report conflates these two scales of risk. Fortunately, one nice thing about that table (Table 1 in the paper) is that readers can choose which of these risks they want to prioritize. I think more longtermist-oriented folks should probably weigh the badness of these as Loss on Control being the most bad, followed perhaps by Bad Lock-in, then Misuse and War. But obviously there’s a lot of variance within these.
I agree that there *might* be some cases where policymakers will have difficult trade-offs to make about these risks. I’m not sure how likely I think this is, but I agree it’s a good reason we should keep this nuance insofar as we can. I guess it seems to me like we’re not anywhere near the right decision makers actually making these tradeoffs, nor near them having values that particularly up-weigh the long term future.
I therefore feel okay about lumping these together in a lot of my communication these days. But perhaps this is the wrong call, idk.
The viability of a pause is dependent on a bunch of things, like the number of actors who could take some dangerous action, how hard it would be for them to do that, how detectable it would be, etc. These are variable factors. For example, if the world got rid of advanced AI chips completely, dangerous AI activities would then take a long time and be super detectable. We talk about this in the research agenda; there are various ways to extend “breakout time”, and these methods could be important to long-term stability.
AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions
I think your main point is probably right but was not well argued here. It seems like the argument is a vibe argument of like “nah they probably won’t find this evidence compelling”.
You could also make an argument from past examples where there has been large action to address risks in the world, and look at the evidence there (e.g., banning of CFCs, climate change more broadly, tobacco regulation, etc.)
You could also make an argument from existing evidence around AI misbehavior and how its being dealt with, where (IMO) ‘evidence much stronger than internals’ basically doesn’t seem to affect the public conversation outside the safety community (or even much here).
I think it’s also worth saying a thing very directly: just because non-behavioral evidence isn’t likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs. Buck’s previous post and many others discuss the rough epistemic situation when it comes to detecting misalignment. Internals evidence is going to be one of the tools in the toolkit, and it will be worth keeping in mind.
Another thing worth saying: if you think scheming is plausible, and you think it will be difficult to update against scheming from behavioral evidence (Buck’s post), and you think non-behavioral evidence is not likely to be widely convincing (this post), then the situation looks really rough.
I appreciate this post, I think it’s a useful contribution to the discussion. I’m not sure how much I should be updating on it. Points of clarification:
Within the first three months of our company’s existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.
Have you upgraded these benchmarks? Is it possible that the diminishing returns you’re seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are the models getting? i.e., somebody could make the same observation about MMLU and basically be like “we’ve seen only trivial improvements since GPT-4”, but that’s because the benchmark is not differentiating progress well after like the high 80%s (in turn I expect this is due to test error and the distribution of question difficulty).
Is it correct that your internal benchmark is all cybersecurity tasks? Soeren points out that companies may be focusing much less on cyber capabilities than general SWE.
How much are you all trying to elicit models’ capabilities, and how good do you think you are? E.g., do you spend substantial effort identifying where the models are getting tripped up and trying to fix this? Or are you just plugging each new model into the same scaffold for testing (which I want to be clear is a fine thing to do, but is useful methodology to keep in mind). I could totally imagine myself seeing relatively little performance gains if I’m not trying hard to elicit new model capabilities. This would be even worse if my scaffold+ was optimized for some other model, as now I have an unnaturally high baseline (this is a very sensible thing to do for business reasons, as you want a good scaffold early and it’s a pain to update, but it’s useful methodology to be aware of when making model comparisons). Especially re the o1 models, as Ryan points out in a comment.
One of the assumptions guiding the analysis here is that sticker prices will approach marginal costs in a competitive market. DeepSeek recently released data about their production inference cluster (or at least one of them). If you believe their numbers, they report theoretical (assuming no discounts and assuming use of the more expensive model) daily revenue of $562,027, with a cost profit margin of 545%. DeepSeek is one of, if not the, lowest price providers for the DeepSeek-R1 and DeepSeek-V3 models. So this data indicates that even the relatively cheap providers could be making substantial profits, providing evidence against the minimum priced provider being near marginal cost.
Observations About LLM Inference Pricing
Hm, sorry, I did not mean to imply that the defense/offense ratio is infinite. It’s hard to know, but I expect it’s finite for the vast majority of dangerous technologies[1]. I do think there are times where the amount of resources and intelligence needed to do defense are too high and a civilization cannot do them. If an astroid were headed for earth 200 years ago, we simply would not have been able to do anything to stop it. Asteroid defense is not impossible in principle — the defensive resources and intelligence needed are not infinite — but they are certainly above what 1825 humanity could have mustered in a few years. It’s not in principle impossible, but it’s impossible for 1825 humanity.
While defense/offense ratios are relevant, I was more-so trying to make the points that these are disjunctive threats, some might be hard to defend against (i.e., have a high defense-offense ratio), and we’ll have to do that on a super short time frame. I think this argument goes through unless one is fairly optimistic about the defense-offense ratio for all the technologies that get developed rapidly. I think the argumentative/evidential burden to be on net optimistic about this situation is thus pretty high, and per the public arguments I have seen, unjustified.
(I think it’s possible I’ve made some heinous reasoning error that places too much burden on the optimism case, if that’s true, somebody please point it out)
- ^
To be clear, it certainly seems plausible that some technologies have a defense/offense ratio which is basically unachievable with conventional defense, and that you need to do something like mass surveillance to deal with these. e.g., triggering vacuum decay seems like the type of thing where there may not be technological responses that avert catastrophe if the decay has started, instead the only effective defenses are ones that stop anybody from doing the thing to begin with.
- ^
I think your discussion for why humanity could survive a misaligned superintelligence is missing a lot. Here are a couple claims:
When there are ASIs in the world, we will see ~100 years of technological progress in 5 years (or like, what would have taken humanity 100 years in the absence of AI). This will involve the development of many very lethal technologies.
The aligned AIs will fail to defend the world against at least one of those technologies.
Why do I believe point 2? It seems like the burden of proof is really high to say that “nope, every single one of those dangerous technologies is going to be something that it is technically possible for the aligned AIs to defend against, and they will have enough lead time to do so, in every single case”. If you’re assuming we’re in a world with misaligned ASIs, then every single existentially dangerous technology is another disjunctive source of risk. Looking out at the maybe-existentially-dangerous technologies that have been developed previously and that could be developed in the future, e.g., nuclear weapons, biological weapons, mirror bacteria, false vacuum decay, nanobots, I don’t feel particularly hopeful that we will avoid catastrophe. We’ve survived nuclear weapons so far, but with a few very close calls — if you assume other existentially dangerous technologies go like this, then we probably won’t make it past a few of them. Now crunch that all into a few years, and like gosh it seems like a ton of unjustified optimism to think we’ll survive every one of these challenges.
It’s pretty hard to convey my intuition around the vulnerable world hypothesis, I also try to do so here.
I was surprised to see you choose to measure faithfulness using the setup from Chua et al. and Turpin et al. rather than Lanham et al. IMO, the latter is much better, albeit is restricted in that you have to do partial pre-filling of model responses (so you might be constrained on what models you can do it on, but it should be possible on QwQ). I would guess this is partially for convenience reasons, as you already have a codebase that works and you’re familiar with, and partially because you think this is a better setup. Insofar as you think this is a better setup, I would be excited to hear why? Insofar as you might do follow-up work, I am excited to see the tests from Lanham et al. applied here.
I would happily give more thoughts on why I like the measurement methods from Lanham et al., if that’s useful.
I like this blog post. I think this plan has a few problems, which you mention, e.g., Potential Problem 1, getting the will and oversight to enact this domestically, getting the will and oversight/verification to enact this internationally.
There’s a sense in which any plan like this that coordinates AI development and deployment to a slower-than-ludicrous rate seems like it reduces risk substantially. To me it seems like most of the challenge comes from getting to a place of political will from some authority to actually do that (and in the international context there could be substantial trust/verification needs). But nevertheless, it is interesting and useful to think through what some of the details might be of such a coordinated-slow-down regime. And I think this post does a good job explaining an interesting idea in that space.
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (how does this work financially).
To speak to the business side: Currently, the AI industry is relying on large expected future profits to generate investment. This is not a business model which is amenable to pausing for a significant period of time. I would like there to be minimal friction to pausing. One way to solve this problem is to invest heavily (and have a plan to invest more if a pause is imminent or ongoing) in revenue streams which are orthogonal to catastrophic risk, or at least not strongly positively correlated. As an initial brainstorm, these streams might include:
Making really cheap weak models.
AI integration in low-stakes domains or narrow AI systems (ideally combined with other security measures such as unlearning).
Selling AI safety solutions to other AI companies.
A plan for the business side of things should also include something about “what do we do about all the expected equity that employees lose if we pause, and how do we align incentives despite this”, it should probably include a commitment to ensure all investors and business partners understand that a long term pause may be necessary for safety and are okay with that risk (maybe this is sufficiently covered under the current corporate structure, I’m not sure, but those sure can change).
It’s all good and well to have an RSP that says “if X we will pause”, but the situation is probably going to be very messy with ambiguous evidence, crazy race pressures, crazy business pressures from external investors, etc. Investing in other revenue streams could reduce some of this pressure, and (if shared) potentially it could enable a wider pause. e.g., all AI companies see a viable path to profit if they just serve early AGIs for cheap, and nobody has intense business pressure to go to superintelligence.
Second, I would like Anthropic to invest in its ability to make credible commitments about internal activities and model properties. There is more about this in Miles Brundage’s blog post and my paper, as well as FlexHEGs. This might include things like:
cryptographically secured audit trails (version control for models). I find it kinda crazy that AI companies sometimes use external pre-deployment testers and then change a model in completely unverifiable ways and release it to users. Wouldn’t it be so cool if OpenAI couldn’t do that, and instead when their system card comes out there are certificates verifying which model was evaluated and how the model was changed from evaluation to deployment? That would be awesome!,
whistleblower programs, declaring and allowing external auditing of what compute is used for (e.g., differentiating training vs. inference clusters in a clear and relatively unspoofable way),
using TEEs and certificates to attest that the same model is evaluated as being deployed to users, and more.
I think investment/adoption in this from a major AI company could be a significant counterfactual shift in the likelihood of national or international regulation that includes verification. Many of these are also good for being-a-nice-company reasons, like I think it would be pretty cool if claims like Zero Data Retention were backed by actual technical guarantees rather than just trust (which it seems like is the status quo).
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
From the o1 blog post (evidence about the methodology for presenting results but not necessarily the same):
o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Hope that clears things up some!
The ARC-AGI page (which I think has been updated) currently says:
At OpenAI’s direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).
Regarding whether this is a new base model, we have the following evidence:
o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012.
Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it’s a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That’s collaborated by ARC-AGI, average 55k tokens per solution[1], which seems like a ton.
I think this evidence indicates this is likely the same base model as o1, and I would be at like 65% sure, so not super confident.
- ^
edit to add because the phrasing is odd: this is the data being used for the estimate, and the estimate is 33m tokens / (100 tasks * 6 samples per task) = ~55k tokens per sample. I called this “solution” because I expect these are basically 6 independent attempts at answering the prompt, but somebody else might interpret things differently. The last column is “Time/Task (mins)”.
- May 6, 2025, 12:39 AM; 6 points) 's comment on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by (
- ^
I got the model up to 3,000 tokens/s on a particularly long/easy query.
As an FYI, there has been other work on large diffusion language models, such as this: https://www.inceptionlabs.ai/introducing-mercury