I’m the chief scientist at Redwood Research.
ryan_greenblatt
you may already be familiar with my ideas
Yep, I was thinking about outputs from you (in addition to outputs from others) while writing this.
Given that human experts seem to have systematic biases (e.g., away from ensembling), wouldn’t the ensemble also exhibit such biases? Have you taken this into account when saying “My guess is that the answer is yes”?
Yes, I’m including this sort of consideration when I say top human experts do would probably be sufficient.
Your point around companies not wanting to ensemble and wanting to use their own epistemic strategies is a fair one, I’ll think about it more. This is connected to a broader point around political difficulties in making an AI that’s actually good to defer to.
Note that I say “as of the release of various frontier AIs”. The pareto frontier includes the possibility of using cheaper AIs. I maybe should have mentioned this in the post in a footnote. From this perspective the AI pareto frontier only strictly improves (putting aside AIs being de-deployed etc).
How do we (more) safely defer to AIs?
Distinguish between inference scaling and “larger tasks use more compute”
It looks as though steering doesn’t suffice to mostly eliminate all evaluation awareness and unverbalized evaluation awareness if often substantially higher (this is for Opus 4.6, but presumably results also apply in general):
I wonder what the natural way to apply Wright’s law to AI is. Cumulative GPU hours? Total quality adjusted inference usage?
It’s easiest to negotiate and most useful if both sides have a reasonably good estimate of how many chips the other side has. I think it will probably be easy for intelligence agencies on either side to get an estimate within +/- 5% which is sufficent for a minimal version of this. This is made better by active efforts and by government trying to verifiably demonstrate how many chips they have as part of negotiation.
(This is in part because I don’t expect active obfuscation until pretty late, so normal records etc will exist.)
When I say “misaligned AI takeover”, I mean that the acquisition of resources by the AIs would reasonably be considered (mostly) illegitimate, some fraction of this could totally include many humans surviving with a subset of resources (though I don’t currently expect property rights to remain intact in such a scenario very long term). Some of these outcomes could be avoid literal coups or violence while still being illegitimate; e.g. they involve doing carefully planned out capture of governments in ways their citizens/leaders would strongly object to if they understood and things like this drive most of the power acquisition.
I’m not counting it as takeover if “humans never intentionally want to hand over resources to AIs, but due to various effects misaligned AIs end up with all of the resources through trade and not through illegitimate means” (e.g., we can’t make very aligned AIs but people make various misaligned AIs while knowing they are misaligned and thus must be paid wages and AIs form a cartel rather having wages competed down to subsistence levels and thus AIs end up with most of the resources).
I currently don’t expect human disempowerment in favor of AIs (that aren’t appointed successors) conditional on no misalignmed AI takeover, but agree this is possible; it doesn’t form a large enough probability to substantially alter my communication.
People often say US-China deals to slow AI progress and develop AI more safely would be hard to enforce/verify.
However, there are easy to enforce deals: each destroys a fraction of their chips at some level of AI capability. This still seems like it could be helpful and it’s pretty easy to verify.
This is likely worse than a well-executed comprehensive deal which would allow for productive non-capabilities uses of the compute (e.g., safety or even just economic activity). But it’s harder to verify that chips aren’t used to advance capabilities while easy to check if they are destroyed.
See also: Daniel’s post on “Cull the GPUs”.
I wrote up my disagreements in a quick take.
I agree with the core message in Dario Amodei’s essay “The Adolescence of Technology”: AI is an epochal technology that poses massive risks and humanity isn’t clearly going to do a good job managing these risks.
(Context for LessWrong: I think it seems generally useful to comment on things like this. I expect that many typical LessWrong readers will agree with me and find my views relatively predictable, but I thought it would be good to post here anyway.)
However, I also disagree with (or dislike) substantial parts of this essay:
I think it’s reasonable to think that the chance of AI takeover is very high and bad that Dario seemingly dismisses this as “doomerism” and “thinking about AI risks in a quasi-religious way”. It’s important to not conflate between “there are unreasonable sensationalist people who think risks are very high” and the very idea that risk is very high. (Same as how Dario doesn’t want to be associated with everyone arguing that risk is relatively lower, including people doing so in a more sensationalist way.)
I agree that there are unreasonable and sensationalist people arguing risks are very high. I think focusing on weak men is dangerous, especially when seemingly trying (in part) to reassure the world that your actions are reasonable / sufficiently safe.
I think the chance of misaligned AI takeover is much higher than Dario seemingly implies. (I’d say around 40%.) In large part, I think this because I don’t think currently available methods would work on AI systems that are wildly, wildly superhuman and we might quickly reach such systems as part of AIs automating AI R&D.
Dario’s arguments seemingly don’t engage with the idea that AIs might get wildly more capable than humans very rapidly.
I think wildly superhuman (and competitive) systems are much more likely (than much weaker AIs) to be extremely effective consequentialists that are very good at pursuing long range goals and it might be hard to instill the relevant motivations that prevent takeover and/or actually instill the right long range goals. These AIs are also straightforwardly hard to supervise/correct due to being very superhuman: they may often be taking extremely illegible actions that humans can’t understand even after substantial investigation.
By default, training wildly, wildly superhuman AIs with literally the current level of controllability/steerability we have now seems very likely to result in AI takeover (or at least misaligned AI systems ending up with most of the power in the long run). However, we may be able to use AI systems along the way to automate safety research such that our methods also radically improve by the time we reach this point. From my understanding, this is the main hope of AI companies that are thinking about these difficulties at all. Unfortunately, there are various reasons why this could fail.
Dario seems to dismiss the idea that training wildly superhuman AIs within a few years (that run the whole economy) would probably lead to AI takeover without major alignment advances. This dismissal seems unsubstantiated and doesn’t engage with the arguments on the other side.
Dario generally seems overly myopic (focused on current levels of capabilities and methods) while also seemingly acknowledging that capabilities will advance rapidly (including at least to the point where AIs surpass humans). He seems to over update on evidence about current models and generalize that to a regime where we are having possibly many years (decades?) of AI (software) progress in a year driven by AIs automating AI R&D. This could easily yield very different paradigms and mean that empirical evidence about current models is only weakly related to what we see later.
I wish Dario tried harder to engage with the views of people who think risk is much higher (e.g., the views of employees on the Anthropic alignment science team who think this). He often seems to argue against simplified pictures of a case for very high risk and then conclude risk is low [1] .
I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).
This is referencing “As more specific and actionable evidence of risks emerges (if it does), future legislation over the coming years can be surgically focused on the precise and well-substantiated direction of risks, minimizing collateral damage. To be clear, if truly strong evidence of risks emerges, then rules should be proportionately strong.”
We should distinguish between legible evidence that will cause action (without as much backlash) and what actors should do if they were reasonable.
Dario doesn’t clearly explain what “autonomy risks” and “destroying humanity” entails: many, most, or all humans being killed and humanity losing control over the future. I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.
There are some complications and caveats, but I think violent AI takeover that kills most humans is a central bad outcome.
Dario strongly implies that Anthropic “has this covered” and wouldn’t be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered and in an (optimistic for Anthropic) world where Anthropic had a 3 month lead I think the chance of AI takeover would be high, perhaps around 20%. With a 2 or 3 year lead spent well on safety, I think risk would be substantially lower, but I don’t see this as being likely to occur. (As I’d guess Dario would agree.)
I think it’s unhealthy and bad for AI companies to give off a “we have this covered and will do a good job” vibe if they actually believe that even if they were in the lead, risk would be very high. At the very least, I expect many employees at Anthropic working on alignment, safety, and security don’t believe Anthropic has the situation covered.
Dario says “If the exponential continues—which is not certain, but now has a decade-long track record supporting it—then it cannot possibly be more than a few years before AI is better than humans at essentially everything.” This seems solidly wrong to me, there are extremely reasonable (IMO more reasonable) ways of doing a naive trend extrapolation of capabilities that contradict this. [2]
For instance, a naive trend extrapolation of time-horizon implies that it will take 5 years for AIs to complete year long software engineering tasks with 80% reliability (and this could still be consistent with not surpassing the best human software engineers). AIs are relatively better at software engineering and progress is slower in other domains.
Dario says “Given the incredible economic and military value of the technology, together with the lack of any meaningful enforcement mechanism, I don’t see how we could possibly convince them to stop.” I think it would be possible to verify and enforce a deal where both sides slow down. I currently don’t expect sufficient political will for this, but if the US government was reasonable, it should strongly pursue this in my opinion. I think the best deals to improve safety by slowing down AI at a critical time are probably substantially harder to enforce than the easiest arms control treaties but still doable and there are some options that are very easy to enforce and still seem like they could be very valuable (e.g., an agreement where each side destroys some large fraction of its compute so progress slows down from here).
I dislike Dario’s messaging about how to handle authoritarian countries in this essay and in Machines of Loving Grace. He seemingly implies that democracies should use overwhelming power downstream of AI to undermine the sovereignty of authoritarian countries and shouldn’t consider cutting them into deals. I think this encourages these countries to race harder on AI and makes prospects for positive sum deals worse.
Dario doesn’t mention that insufficient computer security complicates the picture on democracies being able to leverage having more compute to slow down (especially at high levels of capability where AIs can automate AI R&D). AI companies currently have sufficiently weak security that the most capable adversaries would easily be able to steal the model and getting to a point where it would be very hard seems difficult and far away. It seems unclear if either side can get a meaningful lead without greatly improving computer security (the US can also steal models from autocracies, so it’s unclear we need to spend on AI progress at all to keep up in a low security regime).
I do think it should be possible for democracies to leverage a compute lead by only spending as much compute (and other resources) as the adversary on capabilities and the rest on safety. This would naturally slow things down and give vastly more resources for safety. However, it could be that at the point we most want to slow down (e.g. AIs fully automating AI R&D), progress is naturally much faster, so cutting compute by 5-10x doesn’t give that much calendar time. So my bottom line is that a compute lead probably does allow for a moderate slow down, but the picture is complicated.
I agree that:
Progress is fast and “powerful AI” as Dario defines it could be only 2 years away (I think the chance of “powerful AI” within 8 years is around 50% and within 3 years is around 25% likely).
Most risk ultimately comes from very capable AI systems.
We should be worried about autonomy/misalignment risks, misuse for seizing power, and society generally going off the rails or going crazy due to AI.
AI is “The single most serious national security threat we’ve faced in a century, possibly ever.” (I’d say ever, at least if we prioritize risks in proportion to their badness.)
I appreciate that:
Dario wrote publicly and relatively clearly about what he thinks about the risks from powerful AI systems.
Dario publicly said that “We need to draw a hard line against AI abuses within democracies. There need to be limits to what we allow our governments to do with AI, so that they don’t seize power or repress their own people.” I think this may have significant political costs for Anthropic and I appreciate the courage. (I also agree with the concern.)
- ↩︎
If I had to guess a number based on his statements in the essay, I’d guess he expects a 5% chance of misaligned AI takeover.
- ↩︎
Dario also makes a somewhat false claim about the capability progression over the last 3 years (likely a mistake about the time elapsed due to sitting on this essay for a while?). He says “Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code.” I think this was basically true for “4 years ago”, but not 3. A little less than 3 years ago, GPT-4 was released and GPT-3.5 was already out 3 years ago (text-davinci-003). Both of these models could certainly write some code and solve a reasonably large fraction of elementary school math problems.
Yeah, I don’t buy that this model can/should be applied with r=1.6 to right now, though I agree it could be general enough in priciple.
Sorry, doesn’t this web app assume full automation of AI R&D as the starting point? I don’t buy that you can just translate this model to the pre full automation regime.
On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn’t that useful for CBRN misuse countermeasures:
An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn’t know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here’s why.
I’ll first acknowledge a potential strength of this approach. If models simply didn’t know much about dangerous topics, we wouldn’t have to worry about people jailbreaking them or stealing model weights—they just wouldn’t be able to help with dangerous topics at all. This is an appealing property that’s hard to get with other safety approaches.
However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn’t enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work).
Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models’ ability to do their own research improves.^ Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks.
^The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn’t rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.
Apparently, similar work on prompt repetition came out a bit before I published this post (but after I ran my experiments) and reproduces the effect. It looks like this paper doesn’t test the “no output / cot prior to answer setting” and instead tests the setting where you use non-reasoning LLMs (e.g. Opus 4.5 without extended thinking enabled).
To be clear, I’m probably not (highly) conterfactual for other work in this area. As I note in this post:
While I expect that the sort of proposal I discuss here is well known, there are many specific details I discuss here which I haven’t seen discussed elsewhere. If you are reasonably familiar with this sort of proposal, consider just reading the “Summary of key considerations” section which summarizes the specific and somewhat non-obvious points I discuss in this post.
I think this idea has been independently invented many times; this post mostly adds additional consideration and popularization.
(TBC, I tenatively believe this post had a reasonably large impact, but through these mechanisms not by inventing the idea!)
Well designed needle in a haystack should be hard to do “normal” software. (Like maybe a very weak LLM can notice the high perplexity, but I think it should be very hard to write a short (“normal”) python function that doesn’t use libraries and does this.)
The fit is notably better for “cumulative investment over time”. Years still produces a slightly better fit.
I’ve cut off the fit as of 2010, about when the original version of moore’s law stops. If you try to project out after 2010, then I think cumulative investment would do better, but I think only because of investment slowing in response to moore’s law dying.
(Doing the fit to a lagged-by-3 years investment series doesn’t make any important difference.)
Easy for claude to bypass though (it does actually fetch a pdf that you can directly download from Anthropic’s servers).
Good point, this is an error in my post.