In 2021, a circle of researchers left OpenAI, after a bitter dispute with their executives. They started a competing company, Anthropic, stating that they wanted to put safety first. The safety community responded with broad support. Thought leaders recommended engineers to apply, and allied billionaires invested.[1]
Anthropic’s focus has shifted – from internal-only research and cautious demos of model safety and capabilities, toward commercialising models for Amazon and the military.
Despite the shift, 80,000 Hours continues to recommend talented engineers to join Anthropic.[2] On the LessWrong forum, many authors continue to support safety work at Anthropic, but I also see side-conversations where people raise concerns about premature model releases and policy overreaches. So, a bunch of seemingly conflicting opinions about work by different Anthropic staff, and no overview. But the bigger problem is that we are not evaluating Anthropic on its original justification for existence.
Did early researchers put safety first? And did their work set the right example to follow, raising the prospect of a ‘race to the top’? If yes, we should keep supporting Anthropic. Unfortunately, I argue, it’s a strong no. From the get-go, these researchers acted in effect as moderate accelerationists.
Some limitations of this post:
I was motivated to write because I’m concerned about how contributions by safety folks to AGI labs have accelerated development, and want this to be discussedmore. Anthropic staff already make cogent cases on the forum for how their work would improve safety. What is needed is a clear countercase. This is not a balanced analysis.
I skip many nuances. The conclusion seems roughly right though, because of overdetermination. Two courses of action – scaling GPT rapidly under a safety guise, starting a ‘safety-first’ competitor that actually competed on capabilities – each shortened timelines so much that no other actions taken could compensate. Later actions at Anthropic were less bad but still worsened the damage.[3]
I skip details of technical safety agendas because these carry little to no weight. As far as I see, there was no groundbreaking safety progress at or before Anthropic that can justify the speed-up that their researchers caused. I also think their minimum necessary aim is intractable (controlling ‘AGI’ enough, in time or ever, to stay safe[4]).
I fail to mention other positive contributions made by Anthropic folks to the world.[5] This feels unfair. If you joined Anthropic later, this post is likely not even about your work, though consider whether you’re okay with following your higher-ups.
I focus on eight collaborators at OpenAI – most of whom worked directly on scaling or releasing GPT-2 and GPT-3 – who went on to found, lead, or advise Anthropic.
I zero in on actions by Dario Amodei, since he acted as a leader throughout, and therefore his actions had more influence and were covered more in public reporting. If you have inside knowledge, please chip in and point out any misinterpretations.
I imply GPT was developed just by Dario and others from the safety community. This is not true. Ilya Sutskever, famous for scaling AlexNet’s compute during his PhD under Hinton, officially directed scaling the transformer models at OpenAI. Though Ilya moved to the ‘Superalignment’ team and left to found ‘Safe Superintelligence’, he does not seem to be much in discussions with safety folks here. Other managers publicly committed to support ‘safety’ work (e.g. Sam Altman), but many did not (e.g. Dario’s co-lead, Alec Radford). All joined forces to accelerate development.
I have a perspective on what ‘safety’ should be about: Safety is about constraining a system’s potential for harm. Safety is about protecting users and, from there, our society and ecosystem at large. If one cannot even design a product and the business operations it relies on to not harm current living people, there is no sound basis to believe that scaling that design up will not also deeply harm future generations. → If you disagree with this perspective, then section 4 and 5 are less useful for you.
Let’s dig into five courses of action:
1. Scaled GPT before founding Anthropic
Dario Amodei co-led the OpenAI team that developed GPT-2 and GPT-3. He, Tom Brown, Jared Kaplan, Benjamin Mann, and Paul Christiano were part of a small cohort of technical researchers responsible for enabling OpenAI to release ChatGPT.
This is covered in a fact-checked book by the journalist Karen Hao. I was surprised by how large the role of Dario was, whom for years I had seen as a safety researcher. His scaling of GPT was instrumental, not only in setting Dario up for founding Anthropic in 2021, but also in setting off the boom after ChatGPT.
So I’ll excerpt from the book, to provide the historical context for the rest of this post:
GPT-1 barely received any attention. But this was only the beginning. Radford had validated the idea enough to continue pursuing it. The next step was more scale.
Radford was given more of the company’s most precious resource: compute. His work dovetailed with a new project Amodei was overseeing in AI safety, in line with what Nick Bostrom’s Superintelligence had suggested. In 2017, one of Amodei’s teams began to explore a new technique for aligning AI systems to human preferences. They started with a toy problem, teaching an AI agent to do backflips in a virtual video game–like environment.
Amodei wanted to move beyond the toy environment, and Radford’s work with GPT-1 made language models seem like a good option. But GPT-1 was too limited. “We want a language model that humans can give feedback on and interact with,” Amodei told me in 2019, where “the language model is strong enough that we can really have a meaningful conversation about human values and preferences.”
Radford and Amodei joined forces. As Radford collected a bigger and more diverse dataset, Amodei and other AI safety researchers trained up progressively larger models. They set their sights on a final model with 1.5 billion parameters, or variables, at the time one of the largest models in the industry. The work further confirmed the utility of Transformers, as well as an idea that another one of Amodei’s teams had begun to develop after their work on OpenAI’s Law. There wasn’t just one empirical law but many. His team called them collectively “scaling laws.”
Where OpenAI’s Law described the pace at which the field had previously expanded its resources to advance Al performance, scaling laws described the relationship between the performance of a deep learning model and three key inputs: the volume of a model’s training data, the amount of compute it was trained on, and the number of its parameters.
Previously, AI researchers had generally understood that increasing these inputs somewhat proportionally to one another could also lead to a somewhat proportional improvement in a model’s capabilities. … Many at OpenAl had been pure language skeptics, but GPT-2 made them reconsider. Training the model to predict the next word with more and more accuracy had gone quite far in advancing the model’s performance on other seemingly loosely related language processing tasks. It seemed possible, even plausible, that a GPT model could develop a broader set of capabilities by continuing down this path: pushing its training and improving the accuracy of its next-word-prediction still further. Amodei began viewing scaling language models as-though likely not the only thing necessary to reach AGI—perhaps the fastest path toward it. It didn’t help that the robotics team was constantly running into hardware issues with its robotic hand, which made for the worst combination: costly yet slow progress.
But there was a problem: If OpenAI continued to scale up language models, it could exacerbate the possible dangers it had warned about with GPT-2. Amodei argued to the rest of the company – and Altman agreed – that this did not mean it should shy away from the task. The conclusion was in fact the opposite: OpenAI should scale its language model as fast as possible, Amodei said, but not immediately release it. … For the Gates Demo in April 2019, OpenAl had already scaled up GPT-2 into something modestly larger. But Amodei wasn’t interested in a modest expansion. If the goal was to increase OpenAI’s lead time, GPT-3 needed to be as big as possible. Microsoft was about to deliver a new supercomputer to OpenAI as part of its investment, with ten thousand Nvidia V100s, what were then the world’s most powerful GPUs for training deep learning models. (The V was for Italian chemist and physicist Alessandro Volta). Amodei wanted to use all of those chips, all at once, to create the new large language model.
The idea seemed to many nothing short of absurdity. Before then, models were already considered large-scale if trained on a few dozen chips. In top academic labs at MIT and Stanford, PhD students considered it a luxury to have ten chips. In universities outside the US, such as in India, students were lucky to share a single chip with multiple peers, making do with a fraction of a GPU for their research.
Many OpenAI researchers were skeptical that Amodei’s idea would even work. Some also argued that a more gradual scaling approach would be more measured, scientific, and predictable. But Amodei was adamant about his proposal and had the backing of other leaders. Sutskever was keen to play out his hypothesis of scaling Transformers; Brockman wanted to continue raising the company’s profile; Altman was pushing to take the biggest swing possible. Soon after, Amodei was promoted to a VP of research.
Dario Amodei insisted on scaling fast, even as others suggested a more gradual approach. It’s more than that – his circle actively promoted it. Dario’s collaborator and close friend, Jared Kaplan, led a project to investigate the scaling of data, compute, and model size.
In January 2020, Jake and Dario published the Scaling Laws paper along with Tom Brown and Sam McCandlish (later CTO at Anthropic). Meaning that a majority of Anthropic’s founding team of seven people were on this one paper.
None of this is an infohazard, but it does pull the attention – including of competitors – toward the idea of scaling faster. This seems reckless – if you want to have more gradual development so you have time to work on safety, then what is the point? There is a scientific interest here, but so was there in scaling the rate of fission reactions. If you go ahead publishing anyway, you’re acting as a capability researcher, not a safety researcher.
This was not the first time.
In June 2017, Paul Christiano, who later joined Anthropic as trustee, published about a technique he invented, reinforcement learning from human feedback. His co-authors include Dario and Tom – as well as Jan Leike, who joined Anthropic later.
Here is the opening text:
For sophisticated reinforcement learning systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems.
The authors here emphasised making agents act usefully by solving tasks cheaply enough.
Recall that Dario joined forces on developing GPT because he wanted to apply RLHF to non-toy-environments. This allowed Dario and Paul to make GPT usable in superficially safe ways and, as a result, commercialisable. Paul later gave justifications why inventing RLHF and applying this technique to improving model functionality had low downside. There are reasons to be skeptical.
In December 2020, Dario’s team published the paper that introduced GPT-3. Tom is the first author of the paper, followed by Benjamin Mann, another Anthropic founder.
Here is the opening text:
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.
To me, this reads like the start of a recipe for improving capabilities. If your goal is actually to prevent competitors from accelerating capabilities, why tell them the way?
But by that point, the harm had already been done, as covered in Karen Hao’s book:
The unveiling of the GPT-3 API in June 2020 sparked new interest across the industry to develop large language models. In hindsight, the interest would look somewhat lackluster compared with the sheer frenzy that would ignite two years later with ChatGPT. But it would lay the kindling for that moment and create an all the more spectacular explosion.
At Google, researchers shocked that OpenAI had beat them using the tech giant’s own invention, the Transformer, sought new ways to get in on the massive model approach. Jeff Dean, then the head of Google Research, urged his division during an internal presentation to pool together the compute from its disparate language and multimodal research efforts to train one giant unified model. But Google leaders wouldn’t adopt Dean’s suggestion until ChatGPT spooked them with a “code red” threat to the business, leaving Dean grumbling that the tech giant had missed a major opportunity to act earlier.
At DeepMind, the GPT-3 API launch roughly coincided with the arrival of Geoffrey Irving, who had been a research lead in OpenAI’s Safety clan before moving over. Shortly after joining DeepMind in October 2019, Irving had circulated a memo he had brought with him from OpenAI, arguing for the pure language hypothesis and the benefits of scaling large language models. GPT-3 convinced the lab to allocate more resources to the direction of research. After ChatGPT, panicked Google leaders would merge the efforts at DeepMind and Google Brain under a new centralized Google DeepMind to advance and launch what would become Gemini.
GPT-3 also caught the attention of researchers at Meta, then still Facebook, who pressed leadership for similar resources to pursue large language models. But leaders weren’t interested, leaving the researchers to cobble together their own compute under their own initiative. Yann LeCun, the chief AI scientist at Meta...had a distaste for OpenAI and what he viewed as its bludgeon approach to pure scaling. He didn’t believe the direction would yield true scientific advancement and would quickly reveal its limits. ChatGPT would make Mark Zuckerberg deeply regret sitting out the trend and marshal the full force of Meta’s resources to shake up the generative AI race.
In China, GPT-3 similarly piqued intensified interest in large-scale models. But as with their US counterparts, Chinese tech giants, including e-commerce giant Alibaba, telecommunications giant Huawei, and search giant Baidu, treated the direction as a novel addition to their research repertoire, not a new singular path of AI development warranting the suspension of their other projects. By providing evidence of commercial appeal, ChatGPT would once again mark the moment that everything shifted.
Although the industry’s full pivot to OpenAI’s scaling approach might seem slow in retrospect, in the moment itself, it didn’t feel slow at all. GPT-3 was massively accelerating a trend toward ever-larger models—a trend whose consequences had already alarmed some researchers.
So GPT-3 – as scaled by Dario’s team and linked up to an API – had woken up capability researchers at other labs, even though their executives were not yet budging on strategy.
Others were alarmed and advocated internally against scaling large language models. But these were not AGI safety researchers, but critical AI researchers, like Dr. Timnit Gebru.
In March 2021, Timnit Gebru collaborated on a paper that led to her expulsion by Google leaders (namely Jeff Dean). Notice the contrast to earlier quoted opening texts:
The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks?
This matches how I guess careful safety researchers write. Cover the past architectural innovations but try not to push for more. Focus on risks and paths to mitigate those risks.
Instead, Dario’s circle acted as capability researchers at OpenAI. At the time, at least three rationales were given for why scaling capabilities is a responsible thing to do:
Rationale #1: ‘AI progress is inevitable’
Dario’s team expected that if they did not scale GPT, this direction of development would have happened soon enough at another company anyway. This is questionable.
Even the originator of transformers, Google, refrained from training on copyrighted text. Training on a library-sized corpus was unheard of. Even after the release of GPT-3, Jeff Dean, head of AI research at the time, failed to convince Google executives to ramp up investment into LLMs. Only after ChatGPT was released did Google toggle to ‘code red’.
Chinese companies would not have started what OpenAI did, Karen Hao argues:
As ChatGPT swept the world by storm in early 2023, a Chinese AI researcher would share with me a clear-eyed analysis that unraveled OpenAI’s inevitability argument. What OpenAI did never could have happened anywhere but Silicon Valley, he said. In China, which rivals the US in AI talent, no team of researchers and engineers, no matter how impressive, would get $1 billion, let alone ten times more, to develop a massively expensive technology without an articulated vision of exactly what it would look like and what it would be good for. Only after ChatGPT’s release did Chinese companies and investors begin funding the development of gargantuan models with gusto, having now seen enough evidence that they could recoup their investments through commercial applications.
Through the course of my reporting, I would come to conclude something even more startling. Not even in Silicon Valley did other companies and investors move until after ChatGPT to funnel unqualified sums into scaling. That included Google and DeepMind, OpenAI’s original rival. It was specifically OpenAI, with its billionaire origins, unique ideological bent, and Altman’s singular drive, network, and fundraising talent, that created a ripe combination for its particular vision to emerge and take over.
Only Dario and collaborators were massively scaling transformers on texts scraped from pirated books and webpages. If the safety folks had refrained, scaling would have been slower. And OpenAI may have run out of compute – since if it had not scaled so fast to GPT-2+, Microsoft might not have made the $1 billion investment, and OpenAI would not have been able to spend most of it on discounted Azure compute to scale to GPT-3.
Karen Hao covers what happened at the time:
Microsoft, meanwhile, continued to deliberate. Nadella, Scott, and other Microsoft executives were already on board with an initial investment. The one holdout was Bill Gates.
For Gates, Dota 2 wasn’t all that exciting. Nor was he moved by robotics. The robotics team had created a demo of a robotic hand that had learned to solve a Rubik’s Cube through its own trial and error, which had received universally favorable coverage. Gates didn’t find it useful. He wanted an AI model that could digest books, grasp scientific concepts, and answer questions based on the material—to be an assistant for conducting research.
GPT-2 wasn’t even close to grasping scientific concepts, but the model could do some basic summarization of documents and sort of answer questions. Perhaps, some of OpenAI’s researchers wondered, if they trained a larger model on more data and to perform tasks that at least looked more like what Gates wanted, they could sway him from being a detractor to being, at minimum, neutral. In April 2019, a small group of those researchers flew to Seattle to give what they called the Gates Demo of a souped-up GPT-2. By the end of it, Gates was indeed swayed just enough for the deal to go through.
No other company was prepared to train transformer models on text at this scale. And it’s unclear whether OpenAI would have gotten to a ChatGPT-like product without the efforts of Dario and others in his safety circle. It’s not implausible that OpenAI would have caved in.[6] It was a nonprofit that was bleeding cash on retaining researchers who were some of the most in-demand in the industry, but kept exploring various unprofitable directions.
The existence of OpenAI shortened the time to a ChatGPT-like product by, I guess, at least a few years. It was Dario’s circle racing to scale to GPT-2 and GPT-3 – and then racing to compete at Anthropic – that removed most of the bottlenecks to getting there.
What if upon seeing GPT-1, they had reacted “Hell no. The future is too precious to gamble on capability scaling”? What if they looked for allies, and used any tactic on the books to prevent dangerous scaling? They didn’t seem motivated to. If they had, they would have been forced to leave earlier, as Timnit Gebru was. But our communities would now be in a better position to make choices, than where they actually left us.
Rationale #2: ‘we scale first so we can make it safe’
Recall this earlier excerpt:
But there was a problem: If OpenAI continued to scale up language models, it could exacerbate the possible dangers it had warned about with GPT-2. Amodei argued to the rest of the company – and Altman agreed – that this did not mean it should shy away from the task. The conclusion was in fact the opposite: OpenAI should scale its language model as fast as possible, Amodei said, but not immediately release it.
Dario thought that by getting ahead, his research circle could then take the time to make the most capable models safe before (commercial) release. The alternative in his eyes was allowing reckless competitors to get there first and deploy faster.
While Dario’s circle cared particularly for safety in the existential sense, in retrospect it seems misguided to justify actual accelerated development with speculative notions of maybe experimentally reaching otherwise unobtained safety milestones. What they ended up doing was use RLHF to finetune models for relatively superficial safety aspects.
Counterfactually, any company first on the scene here would likely have finetuned their models anyway for many of the same safety aspects, forced by the demands by consumers and government enforcement agencies. Microsoft’s staff did so, after its rushed Sydney release of GPT-4 triggered intense reactions by the public.
Maybe though RLHF enabled interesting work on complex alignment proposals. But is this significant progress on the actual hard problem? Can any such proposal be built into something comprehensive enough to keep fully autonomous learning systems safe?
Dario’s rationale further relied on his expectation that OpenAI’s leaders would delay releasing the models his team had scaled up, and that this would stem a capability race.
But OpenAI positioned itself as ‘open’. And its researchers participated in an academic community where promoting your progress in papers and conferences is the norm. Every release of a GPT codebase, demo, or paper alerted other interested competing researchers. Connor Leahy could just use Dario’s team’s prior published descriptions of methods to train his own version. Jack Clark, who was policy director of OpenAI and now is at Anthropic, ended up delaying the release of GPT-2’s code by around 9 months.
Worse, GPT-3 was packaged fast into a commercial release through Microsoft. This was not Dario’s intent, who apparently felt Sam Altman had misled him. Dario did not discern he was being manipulated by a tech leader with a track record of being manipulative.
By scaling unscoped models that hide all kinds of bad functionality, and can be misused at scale (e.g. to spread scams or propaganda), Dario’s circle made society less safe. By simultaneously implying they could or were making these inscrutable models safe, they were in effect safety-washing.
Chris Olah’s work on visualising circuits and mechanistic interpretability made for flashy articles promoted on OpenAI’s homepage. In 2021, I saw an upsurge of mechinterp teams joining AI Safety Camp, whom I supported, seeing it as cool research. It nerdsniped many, but progress in mechinterp has remained stuck around mapping the localised features of neurons and the localised functions of larger circuits, under artificially constrained input distributions. This is true even of later work at Anthropic, which Chris went on to found.
Some researchers now dispute that mapping mechanistic functionality is a tractable aim. The actual functioning of a deployed LLM is complex, since it not only depends on how shifting inputs received from the world are computed into outputs, but also how those outputs get used or propagated in the world.
Externally, “the outputs…go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences” (to quote Eliezer Yudkowsky).
Traction is limited in terms of the subset of input-to-output mappings that get reliably interpreted, even in a static neural network. Even where computations of inputs to outputs are deterministically mapped, this misses how outputs end up corresponding to effects in the noisy physical world (and how effects feed back into model inputs/training).
Interpretability could be used for specific safety applications, or for AI ‘gain of function’ research. I’m not necessarily against Chris’ research. What’s bad is how it got promoted.
Researchers in Chris’ circle promoted interpretability as a solution to an actual problem (inscrutable models) that they were making much worse (by scaling the models). They implied the safety work to be tractable in a way that would catch up with the capability work that they were doing. Liron Shapira has a nice term for this: tractability-washing.
Tractability-washing corrupts. It disables our community from acting with integrity to prevent reckless scaling. If instead of Dario’s team, accelerationists at Meta had taken over GPT training, we could at least know where we stand. Clearly then, it was reckless to scale data by 100x, parameters by 1000x, and compute by 10000x – over just three years.
Butsafety researchers did this, making it hard to orient. Was it okay to support trusted folks in safety to get to the point that they could develop their own trillion-parameter models? Or was it bad to keep supporting people who kept on scaling capabilities?
Rationale #3: ‘we reduce the hardware overhang now to prevent disruption later’
Another fairly common argument and motivation at OpenAI in the early days was the risk of “hardware overhang,” that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.[7]
Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.
It’s unclear what “many of us” means, and I do not want to presume that Sam accurately represented the views of his employees. But the draft was reviewed by “Paul Christiano, Jack Clark, Holden Karnofsky” – all of whom were already collaborating with Dario.
The rationale of reducing hardware overhang is flawed:
It accelerates hardware production. Using more hardware increases demand for that hardware, triggering a corresponding increase in supply. Microsoft did not just provide more of its data centers to OpenAI, but also built more data centers to house more chips it could buy from Nvidia. Nvidia in turn reacted by scaling up production of its chips, especially once there was a temporary supply shortage.
It is a justification that can be made just as well by someone racing to the bottom. Sam Altman not only tried to use the hardware overhang. Once chips got scarce, Sam pitched the UAE to massively invest in new chip manufacturing. And Tom Brown just before leaving to Anthropic, was in late-stage discussions with Fathom Radiant to get cheap access to their new fibre-optic-connected supercomputer.[8]
It assumes that ‘AGI’ is inevitable and/or desirable. Yet new technologies can be banned (especially when still unprofitable and not depended on by society). And there are sound, nonrefuted reasons why keeping these unscoped autonomously learning and operating machine systems safe would actually be intractable.
2. Founded an ‘AGI’ development company and started competing on capabilities
Karen Hao reports on the run-up to Dario’s circle leaving OpenAI:
Behind the scenes, more than one, including Dario, discussed with individual board members their concerns about Altman’s behavior: Altman had made each of OpenAI’s decisions about the Microsoft deal and GPT-3’s deployment a foregone conclusion, but he had maneuvered and manipulated dissenters into believing they had a real say until it was too late to change course. Not only did they believe such an approach could one day be catastrophically, or even existentially, dangerous, it had proven personally painful for some and eroded cohesion on the leadership team. To people around them, the Amodei siblings would describe Altman’s tactics as “gaslighting” and “psychological abuse.”
As the group grappled with their disempowerment, they coalesced around a new idea. Dario Amodei first floated it to Jared Kaplan, a close friend from grad school and former roommate who worked part time at OpenAI and had led the discovery of scaling laws, and then to Daniela, Clark, and a small group of key researchers, engineers, and others loyal to his views on AI safety. Did they really need to keep fighting for better AI safety practices at OpenAI? he asked. Could they break off to pursue their own vision? After several discussions, the group determined that if they planned to leave, they needed to do so imminently. With the way scaling laws were playing out, there was a narrowing window in which to build a competitor. “Scaling laws mean the requirements for training these frontier things are going to be going up and up and up,” says one person who parted with Amodei. “So if we wanted to leave and do something, we’re on a clock, you know?” … Anthropic people would later frame The Divorce, as some called it, as a disagreement over OpenAI’s approach to AI safety. While this was true, it was also about power. As much as Dario Amodei was motivated by a desire to do what was right within his principles and to distance himself from Altman, he also wanted greater control of AI development to pursue it based on his own values and ideology. He and the other Anthropic founders would build up their own mythology about why Anthropic, not OpenAI, was a better steward of what they saw as the most consequential technology. In Anthropic meetings, Amodei would regularly punctuate company updates with the phrase “unlike Sam” or “unlike OpenAI.” But in time, Anthropic would show little divergence from OpenAI’s approach, varying only in style but not in substance. Like OpenAI, it would relentlessly chase scale. Like OpenAI, it would breed a heightened culture of secrecy even as it endorsed democratic AI development. Like OpenAI, it would talk up cooperation when the very premise of its founding was rooted in rivalry.
There is a repeating pattern here: Founders of an AGI start-up air their concerns about ‘safety’, and recruit safety-concerned engineers and raise initial funding that way. The culture sours under controlling leaders, as the company grows dependent on Big Tech’s compute and billion-dollar investments.
This pattern has roughly repeated three times:
DeepMind was funded by Jaan Tallinn and Peter Thiel (intro’d by Eliezer). Mustafa Suleyman got fired for abusively controlling employees, but Demis Hassabis stayed on. Then DeepMind lost so much money that it had to be acquired by Google.
Distrusting DeepMind (as now directed by Demis under Google), Sam Altman and Elon Musk founded the nonprofit OpenAI. Sam and Elon fight for the CEO position, and Sam gains control. Holden made a grant to this nonprofit, which subsequently acted illegally as a for-profit under investments by Microsoft.
Distrusting OpenAI (as now directed by Sam to appease Microsoft), Daniela and Dario left to found Anthropic. Then all the top billionaires in the safety community invested. Then Anthropic received $8 billion in investments from Amazon.
We are dealing with a gnarly situation.
Principal-agent problem: The safety community supports new founders who convince them they’ll do good work for the cause of safety. For years, safety folks believe the leaders. But once cases of dishonesty get revealed and leaders sideline safety people who are no longer needed, the community distrusts the leaders and allies with new founders.
Rules for rulers: Leaders seek to gain and maintain their positions of power over AI development. In order to do so, they need to install key people who can acquire the resources needed for them to stay in power, and reward those key people handsomely, even if it means extracting from all the other outside citizens who have no say about what the company does.
Race to the bottom:Collaborators at different companies cut corners believing that if they don’t, then their competitors might get there first and make things even worse. The more the people participating treat this as a finite game in which they are acting independently from other untrusted individual players, the more they lose integrity with their values.
One take on this is a brutal realist stance: That’s just how business gets done. They convince us to part with our time and money and drop us when we’re no longer needed, they gather their loyal lackeys and climb to the top, and then they just keep playing this game of extraction until they’ve won.
It is true that’s how business gets done. But I don’t think any of us here are just in it for the business. Safety researchers went to work at Anthropic because they care. I wouldn’t want us to tune out our values – but it’s important to discern where Anthropic’s leaders are losing integrity with the values we shared.
The safety community started with much trust in and willingness to support Anthropic. That sentiment seems to be waning. We are seeing leaders starting to break some commitments and enter into shady deals like OpenAI leaders did – allowing them to gain relevance in circles of influence, and to keep themselves and their company on top.
Something like this happened before, so discernment is needed. It would suck if we support another ‘safety-focussed’ start-up that ends up competing on capabilities.
I’ll share my impression of how Anthropic staff presented their commitments to safety in the early days, and how this seemed in increasing tension with how the company acted.
Capabilities work generates and improves on the models that we investigate and utilize in our alignment research. We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We’ve subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller [bold emphasis added].
The general impression I came away with was that Anthropic was going to be careful not to release models with capabilities that significantly exceeded those of ChatGPT and other competing products. Instead, Anthropic would compete on having a reliable and safe product, and try to pull competitors into doing the same.
Dario has repeatedly called for a race to the top on safety, such as in this Time piece.
Amodei makes the case that the way Anthropic competes in the market can spark what it sees as an essential “race to the top” on safety. To this end, the company has voluntarily constrained itself: pledging not to release AIs above certain capability levels until it can develop sufficiently robust safety measures. Amodei hopes this approach—known as the Responsible Scaling Policy—will pressure competitors to make similar commitments, and eventually inspire binding government regulations. “We’re not trying to say we’re the good guys and the others are the bad guys,” Amodei says. “We’re trying to pull the ecosystem in a direction where everyone can be the good guy.”
Degrading commitments
After safety-allied billionaires invested in Series A and B, Anthropic’s leaders moved on to pitch investors outside of the safety community.
In the deck, Anthropic says that it plans to build a “frontier model” — tentatively called “Claude-Next” — 10 times more capable than today’s most powerful AI, but that this will require a billion dollars in spending over the next 18 months.” … Anthropic estimates its frontier model will require on the order of 10^25 FLOPs, or floating point operations — several orders of magnitude larger than even the biggest models today. Of course, how this translates to computation time depends on the speed and scale of the system doing the computation; Anthropic implies (in the deck) it relies on clusters with “tens of thousands of GPUs.”
Some people in the safety community commented with concerns. Anthropic leaders seemed to act like racing on capabilities was necessary. It felt egregious compared to the expectations that I and friends in safety had gotten from Anthropic. Worse, leaders had kept these new plans hidden from the safety community – it took a journalist to leak it.
From there, Anthropic started releasing models with capabilities that ChatGPT lacked:
In July 2023, Anthropic was the first to introduce a large context window reaching 100,000 tokens (about 75,000 words) compared to ChatGPT’s then 32,768 tokens.
In March 2024, Anthropic released Claude Opus, which became preferred by programmers for working on large codebases. Anthropic’s largest customer is Cursor, a coding platform.
In October 2024, Anthropic was first to release an ‘agent’ that automatically directs actions in the computer browser. It was a beta release that worked pretty poorly and could potentially cause damage for customers.
None of these are major advancements beyond state of the art. You could argue that Anthropic stuck to original commitments here, either deliberately or because they lacked anything substantially more capable than OpenAI to release. Nonetheless, they were competing on capabilities, and the direction of those capabilities is concerning.
If a decade ago, safety researchers had come up with a list of engineering projects to warn about, I guess it would include ‘don’t rush to build agents’, and ‘don’t connect the agent up to the internet’ and ‘don’t build an agent to code by itself’. While the notion of current large language models actually working as autonomous agents is way overhyped, Anthropic engineers are developing models in directions that would have scared early AGI safety researchers. Even from a system safety perspective, it’s risky to build an unscoped system that can modify surrounding infrastructure in unexpected ways (by editing code, clicking through browsers, etc).
Anthropic has definitely been less reckless than OpenAI in terms of model releases. I just think that ‘less reckless’ is not a good metric. ‘Less reckless’ is still reckless.
Another way to look at this is that Dario, like other AI leaders before him, does not think he is acting recklessly, because he thinks things likely go well anyway – as he kept saying:
My guess is that things will go really well. But I think there is...a risk, maybe 10% or 20%, that this will go wrong. And it’s incumbent on us to make sure that doesn’t happen.
Declining safety governance
The most we can hope for is oversight by their board, or by the trust set up to elect new board members. But the board’s most recent addition is Reed Hastings, known for scaling a film subscription company, but not a safe engineering culture. Indeed, the reason given is that Reed “brings extensive experience from founding and scaling Netflix into a global entertainment powerhouse”. Before that, trustees elected Jay Krepps, giving a similar reason: his “extensive experience in building and scaling highly successful tech companies will play an important role as Anthropic prepares for the next phase of growth”. Before that, Yasmin Razavi from Spark Capital joined, for making the biggest investment in the Series C round.
The board lacks any independent safety oversight. It is presided by Daniela Amodei, who along with Dario Amodei has remained there since founding Anthropic. For the rest, three tech leaders joined, prized for their ability to scale companies. There used to be one independent-ish safety researcher, Luke Muehlhauser, but he left one year ago.
The trust itself cannot be trusted. It was supposed to “elect a majority of the board” for the sake of long-term interests such as “to carefully evaluate future models for catastrophic risks”. Instead, trustees brought in two tech guys who are good at scaling tech companies. The trust was also meant to be run by five trustees, but it’s been under that count for almost two years – they failed to replace trustees after two left.
3. Lobbied for policies that minimised Anthropic’s accountability for safety
Jack Clark has been the policy director at Anthropic ever since he left the same role at OpenAI. Under Jack, some of the policy advocacy tended to reduce Anthropic’s accountability. There was a tendency to minimise Anthropic having to abide by any hard or comprehensive safety commitments.
Much of this policy work is behind closed doors. But I rely on just some online materials I’ve read.
I’ll focus on two policy initiatives discussed at length in the safety community:
Anthropic’s advocacy for minimal ‘Responsible Scaling Policies’
Anthropic’s lobbying against provisions in California’s safety bill SB 1047.
Anthropic’s RSPs are well known in the safety community. I’ll just point to the case made by Paul Christiano, a month after he joined Anthropic’s Long-Term Benefit Trust:
I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation….
I think that sufficiently good responsible scaling policies could dramatically reduce risk, and that preliminary policies like Anthropic’s RSP meaningfully reduce risk by creating urgency around key protective measures and increasing the probability of a pause if those measures can’t be implemented quickly enough.
I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should demand a higher degree of safety than AI developers are likely to voluntarily implement.
I think that developers implementing responsible scaling policies now increases the probability of effective regulation. If I instead thought it would make regulation harder, I would have significant reservations.
Transparency about RSPs makes it easier for outside stakeholders to understand whether an AI developer’s policies are adequate to manage risk, and creates a focal point for debate and for pressure to improve.
I think the risk from rapid AI development is very large, and that even very good RSPs would not completely eliminate that risk. A durable, global, effectively enforced, and hardware-inclusive pause on frontier AI development would reduce risk further.
While Paul did not wholeheartedly endorse RSPs, and included some reservations, the thrust of it is that he encouraged the safety community to support Anthropic’s internal and external policy work on RSPs.[9]
A key issue with RSPs is how they’re presented as ‘good enough for now’. If companies adopt RSPs voluntarily, the argument goes, it’d lay the groundwork for regulations later.
Several authors on the forum argued that this was misleading.
Consider the original formulation by Anthropic: “Our RSP focuses on catastrophic risks – those where an AI model directly causes large scale devastation.”
In other words: our company can scale on as long as our staff/trustees do not deem the risk of a new AI model directly causing a catastrophe as sufficiently high.
Is that responsible? It’s assuming that further scaling can be risk managed. It’s assuming that just risk management protocols are enough.
Then, the company invents a new wonky risk management framework, ignoring established and more comprehensive practices.
Paul argues that this could be the basis for effective regulation. But Anthropic et al. lobbying national governments to enforce the use of that wonky risk management framework makes things worse.
It distracts from policy efforts to prevent the increasing harms. It creates a perception of safety (instead of actually ensuring safety).
At the time, Anthropic’s policy team was actively lobbying for RSPs in US and UK government circles. This bore fruit. Ahead of the UK AI Safety Summit, leading AI companies were asked to outline their responsible capability scaling policy. Both OpenAI and Deepmind soon released their own policies on ‘responsible’ scaling.
Some policy folks I knew were so concerned that they went on trips to advocate against RSPs.Organisers put out a treaty petition as a watered–down version of the original treaty, because they wanted to get as many signatories from leading figures, in part to counter Anthropic’s advocacy for self-regulation through RSPs.
Opinions here differ. I think that Anthropic advocated for companies to adopt overly minimal policies that put off accountability for releasing models that violate already established safe engineering practices. I’m going to quote some technical researchers who are experienced in working with and/or advising on these established practices:
Siméon Campos wrote on existing risk management frameworks:
most of the pretty intuitive and good ideas underlying the framework are weak or incomplete versions of traditional risk management, with some core pieces missing. Given that, it seems more reasonable to just start from an existing risk management piece as a core framework. ISO/IEC 23894 or the NIST-inspired AI Risk Management Standards Profile for Foundation Models would be pretty solid starting points.
Heidy Khlaaf wrote on scoped risk assessments before joining UK’s AI Safety Institute:
Consider type certification in aviation; certification and risk assessments are carried out for the approval of a particular vehicle design under specific airworthiness requirements (e.g., Federal Aviation Administration 14 CFR part 21). There is no standard assurance or assessment approach for “generic” vehicle types across all domains. It would be contrary to established safety practices and unproductive to presume that the formidable challenge of evaluating general multi-modal models for all conceivable tasks must be addressed first.
Timnit Gebru replied on industry shortcomings to the National Academy of Engineering:
When you create technology, if you are an engineer, you build something and you say what it is supposed to be used for. You assess what the standard operating characteristics are. You do tests in which you specify the ideal condition and the nonidealities. … It was so shocking for me to see that none of this was being done in the world of AI. One of the papers that I wrote was called “Data Sheets for Data Sets.” This was inspired by the data sheets in electronics. I wrote about how, similar to how we do these tests in other engineering practices, we need to do the same here. We need to document. We need to test. We need to communicate what things should be used for.
Once, a senior safety engineer on medical devices messaged me, alarmed after the release of ChatGPT. It boggled her that such an unscoped product could just be released to the public. In her industry, medical products have to be designed for a clearly defined scope (setting, purpose, users) and tested for safety in that scope. This all has to be documented in book-sized volumes of paper work, and the FDA gets the final say.
Other established industries also have lengthy premarket approval processes. New cars and planes too must undergo audits, before a US government department decides to deny or approve the product’s release to market.
The AI industry, however, is an outgrowth of the software industry, which has a notorious disregard of safety. Start-ups sprint to code up a product and rush through release stages.
XKCD put it well:
At least programmers at start-ups write out code blocks with somewhat interpretable functions. Auto-encoded weights of LLMs, on the other hand, are close to inscrutable.
So that’s the context Anthropic is operating in.
Safety practices in the AI industry are often appalling. Companies like Anthropic ‘scale’ by automatically encoding a model to learn hidden functionality from terabytes of undocumented data, and then marketing it as a product that can be used everywhere.
Releasing unscoped automated systems like this is a set-up for insidious and eventually critical failures. Anthropic can’t evaluate Claude comprehensively for such safety issues.
Staff do not openly admit that they are acting way out of the bounds of established safety practices. Instead, they expect us to trust them having some minimal responsibilities for scaling models. Rather than a race to the top, Anthropic cemented a race to the bottom.
I don’t deny the researchers’ commitment – they want to make general AI generally safe. But if the problem turns out too complex to adequately solve for, or they don’t follow through, we’re stuffed.
Unfortunately, their leaders recently backpedalled on one internal policy commitment:
When Anthropic published its first RSP in September 2023, the company made a specific commitment about how it would handle increasingly capable models: “we will define ASL-2 (current system) and ASL-3 (next level of risk) now, and commit to define ASL-4 by the time we reach ASL-3, and so on.” In other words, Anthropic promised it wouldn’t release an ASL-3 model until it had figured out what ASL-4 meant.
Yet the company’s latest RSP, updated May 14, doesn’t publicly define ASL-4 — despite treating Claude 4 Opus as an ASL-3 model. Anthropic’s announcement states it has “ruled out that Claude Opus 4 needs the ASL-4 Standard.”
When asked about this, an Anthropic spokesperson told Obsolete that the 2023 RSP is “outdated” and pointed to an October 2024 revision that changed how ASL standards work. The company now says ASLs map to increasingly stringent safety measures rather than requiring pre-defined future standards.
Anthropic staked out its responsibility for designing models to be safe, which is minimal. It can change its internal policy any time. We cannot trust its board to keep leaders in line.
This leaves external regulation. As Paul wrote: “I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should demand a higher degree of safety.”
Unfortunately, Anthropic has lobbied to cut down regulations that were widely supported by the public. The clearest case of this is California’s safety bill SB 1047.
Lobbied against provisions in SB 1047
The bill’s demands were light, mandating ‘reasonable care’ in training future models to prevent critical harms. It did not even apply to the model that Anthropic pitched to investors, as requiring 1025 FLOPS for training. The bill only kicked in at a computing power greater than 1026.
Anthropic does not support SB 1047 in its current form. ... We list a set of substantial changes that, if made, would address our multiple concerns and result in a streamlined bill we could support in the interest of a safer, more trustworthy AI industry. Specifically, this includes narrowing the bill to focus on frontier AI developer safety by (1) shifting from prescriptive pre-harm enforcement to a deterrence model that incentivizes developers to implement robust safety and security protocols, (2) reducing potentially burdensome and counterproductive requirements in the absence of actual harm, and (3) removing duplicative or extraneous aspects.
Anthropic’s leaders did not want to be burdened by having to follow certain government-mandated requirements before critical harms occurred. Given how minimal the newly proposed requirements were, in an industry that severely lacks safety enforcement, such lobbying was actively detrimental to safety.
In what he called “a cynical procedural move,” Tegmark noted that Anthropic has also introduced amendments to the bill that touch on the remit of every committee in the legislature, thereby giving each committee another opportunity to kill it. “This is straight out of Big Tech’s playbook,” he said.
Only after amendments, did Anthropic weakly support the bill. At this stage, Dario wrote a letter to Gavin Newsom. This was seen as one of the most significant positive developments at the time by supporters of the bill. But it was too little too late. Newsom canned the bill.
Similarly this year, Dario wrote against the 10 year moratorium on state laws – three weeks after the stipulation was introduced.[10] Still it was a major contribution for a leading AI company to speak out against the moratorium.
All of this raises a question: how much does Dario’s circle actually want to be held to account on safety, over being free to innovate how they want?
I’d say on a personal level SB 1047 has struck me as representative of many of the problems society encounters when thinking about safety at the frontier of a rapidly evolving industry…
How should we balance precaution with an experimental and empirically driven mindset? How does safety get ‘baked in’ to companies at the frontier without stifling them? What is the appropriate role for third-parties ranging from government bodies to auditors?
4. Built ties with AI weapons contractors and the US military
If an AI company’s leaders are committed to safety, one red line to not cross is building systems used to kill people. People dying because of your AI system is a breach of safety.
Once you pursue getting paid billions of dollars to set up AI for the military, you open up bad potential directions for yourself and other companies. A next step could be to get paid to optimise AI to be useful for the military, which in wartime means optimise for killing.
For large language models, there is particular concern around using ISTAR capabilities: intelligence, surveillance, target acquisition, and reconnaissance. Commercial LLMs can be used to investigate individual persons to target, because those LLMs are trained on lots of personal data scraped everywhere from the internet, as well as private chats. The Israeli military already uses LLMs to identify maybe-Hamas-operatives, to track them down and bomb the entire apartment buildings that they live in. Its main strategic ally is the US military-industrial-intelligence complex, which offers explosive ammunition and cloud services to the Israeli military, and is adopting some of the surveillance tactics that Israel has tested in Palestine, though with much less deadly consequences for now.
So that’s some political context. What does any of this have to do with Anthropic?
Anthropic’s intel-defence partnership
Anthropic did not bind itself to not offering models for military or warfare uses, unlike OpenAI. Before OpenAI broke its own prohibition, Anthropic already went ahead without as much public backlash.
In November 2024, Anthropic partnered with Palantir and Amazon to “provide U.S. intelligence and defense agencies access to the Claude 3 and 3.5 family...on AWS”:
We’ve accelerated mission impact across U.S. defense workflows with partners like Palantir, where Claude is integrated into mission workflows on classified networks. This has enabled U.S. defense and intelligence organizations with powerful AI tools to rapidly process and analyze vast amounts of complex data.
Later that month, Amazon invested another $4 billion in Anthropic, which raises a conflict of interest. If Anthropic hadn’t agreed to hosting models for the military using AWS, would Amazon still have invested? Why did Anthropic go ahead with partnering with Palantir, a notorious mass surveillance and autonomous warfare contractor?
No matter how ‘responsible’ Anthropic presents itself to be, it is concerning how its operations are starting to get tied to the US military-industrial-intelligence complex.
In June 2025, Anthropic launched Claude Gov models for US national security clients. Along with OpenAI, it got a $200 million defence contract: “As part of the agreement, Anthropic will prototype frontier AI capabilities that advance U.S. national security.”
I don’t know about you, but prototyping “frontier AI capabilities” for the military seems to swerve away from their commitment to being careful about “capability demonstrations”. I guess Anthropic’s leaders would push for improving model security and preventing adversarial uses, and avoid the use of their models for military target acquisition. Yet Anthropic can still end up contributing to the automation of kill chains.
For one, Anthropic leaders will know little about what US military and intelligence agencies actually use Claude for, since “access to these models is limited to those who operate in such classified environments”. Though from a business perspective, it is a plus to run their models on secret Amazon servers, since Anthropic can distance itself from any mass atrocity committed by their military client. Like Microsoft recently did.
Anthropic’s earlier ties
Amazon is a cloud provider for the US military, raising conflicts of interest for Anthropic. But already before, Anthropic’s leaders had ties to military AI circles. Anthropic received a private investment from Eric Schmidt. Eric is about the best-connected guy in AI warfare. Since 2017, Eric illegally lobbied the Pentagon, chaired two military-AI committees, and then spun out his own military innovation thinktank styled after Henry Kissinger’s during the Vietnam war. Eric invests in drone swarm start-ups and frequently talks about making network-centric autonomous warfare really cheap and fast.
Eric in turn is an old colleague of Jason Matheny. Jason used to be a trustee of Anthropic and still heads the military strategy thinktank RAND corporation. Before that, Jason founded the thinktank Georgetown CSET to advise on national security concerns around AI, receiving a $55 million grant through Holden. Holden in turn is the husband of Daniela and used to be roommates with Dario.
One angle on all this: Dario’s circle acted prudently to gain seats at the military-AI table. Another angle: Anthropic stretched the meaning of ‘safety first’ to keep up its cashflow.
5. Promoted band-aid fixes to speculative risks over existing dangers that are costly to address
In a private memo to employees, Dario wrote he had decided to solicit investments from Gulf State sovereign funds tied to dictators, despite his own misgivings.
Amodei acknowledged that the decision to pursue investments from authoritarian regimes would lead to accusations of hypocrisy. In an essay titled “Machines of Loving Grace,” Amodei wrote: “Democracies need to be able to set the terms by which powerful AI is brought into the world, both to avoid being overpowered by authoritarians and to prevent human rights abuses within authoritarian countries.”
Dario was at pains to ensure that authoritarian governments in the Middle East and China would not gain a technological edge:
“The basis of our opposition to large training clusters in the Middle East, or to shipping H20’s to China, is that the ‘supply chain’ of Al is dangerous to hand to authoritarian governments—since Al is likely to be the most powerful technology in the world, these governments can use it to gain military dominance or to gain leverage over democratic countries,” Amodei wrote in the memo, referring to Nvidia chips.
Still, the CEO admitted investors could gain “soft power” through the promise of future funding. “The implicit promise of investing in future rounds can create a situation where they have some soft power, making it a bit harder to resist these things in the future. In fact, I actually am worried that getting the largest possible amounts of investment might be difficult without agreeing to some of these other things,” Amodei writes. “But l think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions), and then hold firm if they ask.”
But here’s the rub. Anthropic never raised the issue of tech-accelerated authoritarianism in the US. This is a clear risk. By AI-targeted surveillance and personalised propaganda, US society too can get locked into a totalitarian state, beyond anything Orwell imagined.
Talking about that issue would cost Anthropic. It would force leaders to reckon with that they are themselves partnering with a mass surveillance contractor to provide automated data analysis to the intelligence arms of an increasingly authoritarian US government. To end this partnership with its main investor, Amazon, would just be out of the question.
Anthropic also does not campaign about the risk of runaway autonomous warfare, which it is gradually getting tied into by partnering with Palantir to contract for the US military. Instead, it justifies itself as helping a democratic regime combat overseas authoritarians.
Dario does keep warning of the risk of mass automation. Why pick that? Mass job loss is bad, but surely not worse than US totalitarianism or autonomous US-China wars. It can’t be just that job loss is a popular topic Dario gets asked about. Many citizens are alarmed too – on the left and on the right – about how ‘Big Tech’ enables ‘Deep State’ surveillance.
The simplest fitting explanation I found is that warning about future mass automation benefits Anthropic, but warning about how the tech is accelerating US authoritarianism or how the US military is developing a hidden kill cloud is costly to Anthropic.
I’m not super confident about this explanation. But it roughly fits the advocacy I’ve seen.
Anthropic can pitch mass automation to its investors – and it did – going so far as providing a list of targetable jobs. But Dario is not alerting the public about automation now. He is not warning how genAI already automates away the fulfilling jobs of writers and artists, driving some to suicide. He is certainly not offering to compensate creatives for billions of dollars in income loss. Recently, Anthropic’s lawyers warned it may go bankrupt if it had to pay for having pirated books. That is a risk Anthropic worries about.
Cheap fixes for risks that are still speculative
Actually much of Anthropic’s campaigns are on speculative risks covered little in society. Recently it campaigned on model welfare, which is a speculative matter, to put it lightly. Here the current solution is cheap – a feature where Claude can end ‘abusive’ chats.
Anthropic has also campaigned about models generating advice that removes bottlenecks for producing bioweapons. So far, it’s not a huge issue – Claude could regenerate guidance found on webpages that Anthropic scraped from, but those can easily be found back by any actor resourceful enough to follow through. Down the line, this could well turn into a dangerous threat. But focussing on this risk now has a side-benefit: it mostly does not cost Anthropic nor hinder it from continuing to scale and release models.
To deal with the bioweapons risk now, Anthropic uses cheap fixes. Anthropic did not thoroughly document its scraped datasets to prevent its models from regenerating various toxic or dangerous materials. That is sound engineering practice, but too costly. Instead, researchers came up with a classifier that will filter out some (but not all) of the materials depending on the assumptions that the researchers made in designing the classifier.[11] And trained models to block answers to bio-engineering-related questions.
None of this is to say that individual Anthropic researchers are not serious about ‘model welfare’ or ‘AI-enabled bioterrorism’. But their direction of work and costs they can make are supervised and okayed by the executives. Those executives are running a company that is bleeding cash, and needs to turn a profit to keep existing and to satisfy investors.
The company tends to campaign on risks that it can put off or offer cheap fixes for, but swerve around or downplay already widespread issues that would be costly to address.
When it comes to current issues, staff focus on distant frames that do not implicate their company – Chinese authoritarianism but not the authoritarianism growing in the US; future automation of white-collar work but not how authors lose their incomes now.
Example of an existing problem that is costly to address
Anthropic is incentivised to downplay gnarly risks that are already showing up and require high (opportunity) costs to mitigate.
So if you catch Dario inaccurately downplaying existing issues that would cost a lot to address, this provides some signal for how he will respond to future larger problems.
This is more than a question of what to prioritise.
You may not care about climate change, yet still be curious to see how Dario deals with the issue. Since he professes to care, but it’s costly to address.
If Dario is honest about the carbon emissions from Anthropic scaling up computation inside data centers, the implication from the viewpoint of environmental groups would be that Anthropic must stop scaling. Or at least, pay to clean up that pollution.
Instead, Dario makes a vaguely agnostic statement, claiming to be unsure whether their models accelerate climate change or not:
So I mean, I think the cloud providers that we work with have carbon offsets.
Multiple researchers have pointed out issues especially with Amazon (Anthropic’s provider) trying to game carbon offsets. From a counterfactual perspective, the carbon offsets are grossly underestimated. The low-hanging fruit will mostly be captured anyway, meaning that further extra carbon emissions by Amazon belong to the back of the queue of offset interventions. Also, Amazon’s data center offsets are only meant to compensate for carbon-based gas emissions at the end of the supply chain – not for all pollution across the entire hardware supply/operation chain.
It’s a complex question because it’s like, you know, you train a model. It uses a bunch of energy, but then it does a bunch of tasks that might have required energy in other ways. So I could see them as being something that leads to more energy usage or leads to less energy.
Such ‘could be either’ thinking is overcomplicating the issue. Before a human did the task. Now this human still consumes energy to live, plus they or their company use energy-intensive Claude models. On net, this results in more energy usage.
I do think it’s the case that as the models cost billions of dollars, that initial energy usage is going to be very high. I just don’t know whether the overall equation is positive or negative. And if it is negative, then yeah, I think...I think we should worry about it.
Dario does not think he needs to act, because he hasn’t concluded yet that Anthropic’s scaling contributes to pollution. He might not be aware of it, but this line of thinking is similar to tactics used by Big Oil to sow public doubt and thus delay having to restrict production.
Again, you might not prioritise climate change. You might not worry like me about an auto-scaling dynamic, where if AI corps get profitable, they keep reinvesting profits into increasingly automated and expanding toxic infrastructure that extracts more profit.
What is still concerning is that Dario did not take a scout mindset here. He did not seek to find the truth about an issue he thinks we should worry about if it turns out negative.
Conclusion
Dario’s circle started by scaling up GPT’s capabilities at OpenAI, and then moved on to compete with Claude at Anthropic.
They offered sophisticated justifications, and many turned out flawed in important ways. Researchers thought they’d make significant progress on safety that competitors would fail to make, promoted the desirability or inevitability of scaling to AGI while downplaying complex intractable risks, believed their company would act responsibly to delay the release of more capable models, and/or trusted leaders to stick to commitments made to the safety community once their company moved on to larger investors.
Looking back, it was a mistake in my view to support Dario’s circle to start up Anthropic. At the time, it felt like it opened up a path for ensuring that people serious about safety would work on making the most capable models safe. But despite well-intentioned work, Anthropic contributed to a race to the bottom, by accelerating model development and lobbying for minimal voluntary safety policies that we cannot trust its board to enforce.
The main question I want to leave you with: how can the safety community do radically better at discerning company directions and coordinating to hold leaders accountable?
These investors were Dustin Moskovitz, Jaan Tallinn and Sam Bankman-Fried. Dustin was advised to invest by Holden Karnofsky. Sam invested $500 million through FTX, by far the largest investment.
Buck points out that 80K job board restricts Anthropic positions to those ‘vaguely related to safety or alignment’. I forgot to mention this.
Though Anthropic advertises research or engineering positions as safety-focussed, people joining Anthropic can still get directed to work on capabilities or product improvement. I posted a short list of related concerns before here.
As a metaphor, say a conservator just broke off the handles of a few rare Roman vases. Pausing for a moment, he pitches me for his project to ‘put vase safety first’. He gives sophisticated reasons for his approach – he believes he can only learn enough about vase safety by breaking parts of vases, and that if he doesn’t do it now, the vandals will eventually get ahead of him. After cashing in my paycheck, he resumes chipping away at the vase pieces. I walk off. What just happened, I wonder? After reflecting on it, I turn back. It’s not about the technical details, I say. I disagree with the premise – that there is no other choice but to damage Roman vases, in order to prevent the destruction of all Roman vases. I will stop supporting people, no matter how sophisticated their technical reasoning, who keep encouraging others to cause more damage.
As a small example, it is considerate that Anthropic researchers cautioned against educators grading students using Claude. Recently, Anthropic also intervened on hackers who used Claude’s code generation to commit large-scale theft. There are a lot of well-meant efforts coming from people at Anthropic, but I have to zoom out and look at the thrust of where things have gone.
Especially if OpenAI had not received a $30 million grant advised by Holden Karnofsky, who was co-living with the Amodei siblings at the time. This decision not only kept OpenAI afloat after its biggest backer, Elon Musk, left soon after. It also legitimised OpenAI to safety researchers who were skeptical at the time.
This resembles another argument by Paul for why it wasn’t bad to develop RLHF:
Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong.
While still at OpenAI, Tom Brown started advising Fathom Radiant, a start-up that was building a supercomputer with faster connect using fibre-optic cables. Their discussion was around Fathom Radiant offering discounted compute to researchers to support differential progress on ‘alignment’.
This is based on my one-on-ones with co-CEO Michael Andregg. At the time, Michael was recruiting engineers from the safety community, turning up at the EA Global conferences, the EA Operations slack, and so on. Michael told me that OpenAI researchers had told him that Fathom Radiant was the most advanced technically out of all the start-ups they had looked at. Just after safety researchers left OpenAI, Michael told me that he was planning to support OpenAI by giving them compute to support their alignment work, but that they had now decided not to.
My educated guess is that Fathom Radiant moved on to offering low-price-tier computing services to Anthropic. A policy researcher I know in the AI Safety community told me he also thinks it’s likely. He told me that Michael was roughly looking for anything that could justify that what he’s doing is good.
It’s worth noting how strong Fathom Radiant’s ties were with people who later ended up in Anthropic C-suite. Virginia Blanton served as the Head of Legal and Operations at Fathom Radiant, and then left to be Director of Operations at Anthropic (I don’t know what she does now; her LinkedIn profile is deleted).
I kept checking online for news over the years. It now seems that Fathom Radiant’s supercomputer project has failed. Fathomradiant.co now redirects to a minimal company website for “Atomos Systems”, which builds “general-purpose super-humanoid robots designed to operate and transform the physical world”. Michael Andregg is also no longer shown in the team – only his brother William Andregg.
It’s also notable that Paul was now tentatively advocating for a pause on hardware development/production, after having justified reducing the hardware overhang in years prior.
Maybe Anthropic’s policy team already was advocating against the 10 year moratorium in private conversations with US politicians? In that case, kudos to them.
This reminds me of how LAION failed to filter out thousands of images of child sexual abuse from their popular image dataset, and then just fixed it inadequately with a classifier.
Anthropic’s leading researchers acted as moderate accelerationists
In 2021, a circle of researchers left OpenAI, after a bitter dispute with their executives. They started a competing company, Anthropic, stating that they wanted to put safety first. The safety community responded with broad support. Thought leaders recommended engineers to apply, and allied billionaires invested.[1]
Anthropic’s focus has shifted – from internal-only research and cautious demos of model safety and capabilities, toward commercialising models for Amazon and the military.
Despite the shift, 80,000 Hours continues to recommend talented engineers to join Anthropic.[2] On the LessWrong forum, many authors continue to support safety work at Anthropic, but I also see side-conversations where people raise concerns about premature model releases and policy overreaches. So, a bunch of seemingly conflicting opinions about work by different Anthropic staff, and no overview. But the bigger problem is that we are not evaluating Anthropic on its original justification for existence.
Did early researchers put safety first? And did their work set the right example to follow, raising the prospect of a ‘race to the top’? If yes, we should keep supporting Anthropic. Unfortunately, I argue, it’s a strong no. From the get-go, these researchers acted in effect as moderate accelerationists.
Some limitations of this post:
I was motivated to write because I’m concerned about how contributions by safety folks to AGI labs have accelerated development, and want this to be discussed more. Anthropic staff already make cogent cases on the forum for how their work would improve safety. What is needed is a clear countercase. This is not a balanced analysis.
I skip many nuances. The conclusion seems roughly right though, because of overdetermination. Two courses of action – scaling GPT rapidly under a safety guise, starting a ‘safety-first’ competitor that actually competed on capabilities – each shortened timelines so much that no other actions taken could compensate. Later actions at Anthropic were less bad but still worsened the damage.[3]
I skip details of technical safety agendas because these carry little to no weight. As far as I see, there was no groundbreaking safety progress at or before Anthropic that can justify the speed-up that their researchers caused. I also think their minimum necessary aim is intractable (controlling ‘AGI’ enough, in time or ever, to stay safe[4]).
I fail to mention other positive contributions made by Anthropic folks to the world.[5] This feels unfair. If you joined Anthropic later, this post is likely not even about your work, though consider whether you’re okay with following your higher-ups.
I focus on eight collaborators at OpenAI – most of whom worked directly on scaling or releasing GPT-2 and GPT-3 – who went on to found, lead, or advise Anthropic.
I zero in on actions by Dario Amodei, since he acted as a leader throughout, and therefore his actions had more influence and were covered more in public reporting. If you have inside knowledge, please chip in and point out any misinterpretations.
I imply GPT was developed just by Dario and others from the safety community. This is not true. Ilya Sutskever, famous for scaling AlexNet’s compute during his PhD under Hinton, officially directed scaling the transformer models at OpenAI. Though Ilya moved to the ‘Superalignment’ team and left to found ‘Safe Superintelligence’, he does not seem to be much in discussions with safety folks here. Other managers publicly committed to support ‘safety’ work (e.g. Sam Altman), but many did not (e.g. Dario’s co-lead, Alec Radford). All joined forces to accelerate development.
I have a perspective on what ‘safety’ should be about: Safety is about constraining a system’s potential for harm. Safety is about protecting users and, from there, our society and ecosystem at large. If one cannot even design a product and the business operations it relies on to not harm current living people, there is no sound basis to believe that scaling that design up will not also deeply harm future generations.
→ If you disagree with this perspective, then section 4 and 5 are less useful for you.
Let’s dig into five courses of action:
1. Scaled GPT before founding Anthropic
Dario Amodei co-led the OpenAI team that developed GPT-2 and GPT-3. He, Tom Brown, Jared Kaplan, Benjamin Mann, and Paul Christiano were part of a small cohort of technical researchers responsible for enabling OpenAI to release ChatGPT.
This is covered in a fact-checked book by the journalist Karen Hao. I was surprised by how large the role of Dario was, whom for years I had seen as a safety researcher. His scaling of GPT was instrumental, not only in setting Dario up for founding Anthropic in 2021, but also in setting off the boom after ChatGPT.
So I’ll excerpt from the book, to provide the historical context for the rest of this post:
Dario Amodei insisted on scaling fast, even as others suggested a more gradual approach. It’s more than that – his circle actively promoted it. Dario’s collaborator and close friend, Jared Kaplan, led a project to investigate the scaling of data, compute, and model size.
In January 2020, Jake and Dario published the Scaling Laws paper along with Tom Brown and Sam McCandlish (later CTO at Anthropic). Meaning that a majority of Anthropic’s founding team of seven people were on this one paper.
None of this is an infohazard, but it does pull the attention – including of competitors – toward the idea of scaling faster. This seems reckless – if you want to have more gradual development so you have time to work on safety, then what is the point? There is a scientific interest here, but so was there in scaling the rate of fission reactions. If you go ahead publishing anyway, you’re acting as a capability researcher, not a safety researcher.
This was not the first time.
In June 2017, Paul Christiano, who later joined Anthropic as trustee, published about a technique he invented, reinforcement learning from human feedback. His co-authors include Dario and Tom – as well as Jan Leike, who joined Anthropic later.
Here is the opening text:
The authors here emphasised making agents act usefully by solving tasks cheaply enough.
Recall that Dario joined forces on developing GPT because he wanted to apply RLHF to non-toy-environments. This allowed Dario and Paul to make GPT usable in superficially safe ways and, as a result, commercialisable. Paul later gave justifications why inventing RLHF and applying this technique to improving model functionality had low downside. There are reasons to be skeptical.
In December 2020, Dario’s team published the paper that introduced GPT-3. Tom is the first author of the paper, followed by Benjamin Mann, another Anthropic founder.
Here is the opening text:
To me, this reads like the start of a recipe for improving capabilities. If your goal is actually to prevent competitors from accelerating capabilities, why tell them the way?
But by that point, the harm had already been done, as covered in Karen Hao’s book:
So GPT-3 – as scaled by Dario’s team and linked up to an API – had woken up capability researchers at other labs, even though their executives were not yet budging on strategy.
Others were alarmed and advocated internally against scaling large language models. But these were not AGI safety researchers, but critical AI researchers, like Dr. Timnit Gebru.
In March 2021, Timnit Gebru collaborated on a paper that led to her expulsion by Google leaders (namely Jeff Dean). Notice the contrast to earlier quoted opening texts:
This matches how I guess careful safety researchers write. Cover the past architectural innovations but try not to push for more. Focus on risks and paths to mitigate those risks.
Instead, Dario’s circle acted as capability researchers at OpenAI. At the time, at least three rationales were given for why scaling capabilities is a responsible thing to do:
Rationale #1: ‘AI progress is inevitable’
Dario’s team expected that if they did not scale GPT, this direction of development would have happened soon enough at another company anyway. This is questionable.
Even the originator of transformers, Google, refrained from training on copyrighted text. Training on a library-sized corpus was unheard of. Even after the release of GPT-3, Jeff Dean, head of AI research at the time, failed to convince Google executives to ramp up investment into LLMs. Only after ChatGPT was released did Google toggle to ‘code red’.
Chinese companies would not have started what OpenAI did, Karen Hao argues:
Only Dario and collaborators were massively scaling transformers on texts scraped from pirated books and webpages. If the safety folks had refrained, scaling would have been slower. And OpenAI may have run out of compute – since if it had not scaled so fast to GPT-2+, Microsoft might not have made the $1 billion investment, and OpenAI would not have been able to spend most of it on discounted Azure compute to scale to GPT-3.
Karen Hao covers what happened at the time:
No other company was prepared to train transformer models on text at this scale. And it’s unclear whether OpenAI would have gotten to a ChatGPT-like product without the efforts of Dario and others in his safety circle. It’s not implausible that OpenAI would have caved in.[6] It was a nonprofit that was bleeding cash on retaining researchers who were some of the most in-demand in the industry, but kept exploring various unprofitable directions.
The existence of OpenAI shortened the time to a ChatGPT-like product by, I guess, at least a few years. It was Dario’s circle racing to scale to GPT-2 and GPT-3 – and then racing to compete at Anthropic – that removed most of the bottlenecks to getting there.
What if upon seeing GPT-1, they had reacted “Hell no. The future is too precious to gamble on capability scaling”? What if they looked for allies, and used any tactic on the books to prevent dangerous scaling? They didn’t seem motivated to. If they had, they would have been forced to leave earlier, as Timnit Gebru was. But our communities would now be in a better position to make choices, than where they actually left us.
Rationale #2: ‘we scale first so we can make it safe’
Recall this earlier excerpt:
Dario thought that by getting ahead, his research circle could then take the time to make the most capable models safe before (commercial) release. The alternative in his eyes was allowing reckless competitors to get there first and deploy faster.
While Dario’s circle cared particularly for safety in the existential sense, in retrospect it seems misguided to justify actual accelerated development with speculative notions of maybe experimentally reaching otherwise unobtained safety milestones. What they ended up doing was use RLHF to finetune models for relatively superficial safety aspects.
Counterfactually, any company first on the scene here would likely have finetuned their models anyway for many of the same safety aspects, forced by the demands by consumers and government enforcement agencies. Microsoft’s staff did so, after its rushed Sydney release of GPT-4 triggered intense reactions by the public.
Maybe though RLHF enabled interesting work on complex alignment proposals. But is this significant progress on the actual hard problem? Can any such proposal be built into something comprehensive enough to keep fully autonomous learning systems safe?
Dario’s rationale further relied on his expectation that OpenAI’s leaders would delay releasing the models his team had scaled up, and that this would stem a capability race.
But OpenAI positioned itself as ‘open’. And its researchers participated in an academic community where promoting your progress in papers and conferences is the norm. Every release of a GPT codebase, demo, or paper alerted other interested competing researchers. Connor Leahy could just use Dario’s team’s prior published descriptions of methods to train his own version. Jack Clark, who was policy director of OpenAI and now is at Anthropic, ended up delaying the release of GPT-2’s code by around 9 months.
Worse, GPT-3 was packaged fast into a commercial release through Microsoft. This was not Dario’s intent, who apparently felt Sam Altman had misled him. Dario did not discern he was being manipulated by a tech leader with a track record of being manipulative.
By scaling unscoped models that hide all kinds of bad functionality, and can be misused at scale (e.g. to spread scams or propaganda), Dario’s circle made society less safe. By simultaneously implying they could or were making these inscrutable models safe, they were in effect safety-washing.
Chris Olah’s work on visualising circuits and mechanistic interpretability made for flashy articles promoted on OpenAI’s homepage. In 2021, I saw an upsurge of mechinterp teams joining AI Safety Camp, whom I supported, seeing it as cool research. It nerdsniped many, but progress in mechinterp has remained stuck around mapping the localised features of neurons and the localised functions of larger circuits, under artificially constrained input distributions. This is true even of later work at Anthropic, which Chris went on to found.
Some researchers now dispute that mapping mechanistic functionality is a tractable aim. The actual functioning of a deployed LLM is complex, since it not only depends on how shifting inputs received from the world are computed into outputs, but also how those outputs get used or propagated in the world.
Internally, a foundational model carries hidden functionality that gets revealed only with certain input keys (this is what allows for undetectable backdoors).
Externally, “the outputs…go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences” (to quote Eliezer Yudkowsky).
Traction is limited in terms of the subset of input-to-output mappings that get reliably interpreted, even in a static neural network. Even where computations of inputs to outputs are deterministically mapped, this misses how outputs end up corresponding to effects in the noisy physical world (and how effects feed back into model inputs/training).
Interpretability could be used for specific safety applications, or for AI ‘gain of function’ research. I’m not necessarily against Chris’ research. What’s bad is how it got promoted.
Researchers in Chris’ circle promoted interpretability as a solution to an actual problem (inscrutable models) that they were making much worse (by scaling the models). They implied the safety work to be tractable in a way that would catch up with the capability work that they were doing. Liron Shapira has a nice term for this: tractability-washing.
Tractability-washing corrupts. It disables our community from acting with integrity to prevent reckless scaling. If instead of Dario’s team, accelerationists at Meta had taken over GPT training, we could at least know where we stand. Clearly then, it was reckless to scale data by 100x, parameters by 1000x, and compute by 10000x – over just three years.
But safety researchers did this, making it hard to orient. Was it okay to support trusted folks in safety to get to the point that they could develop their own trillion-parameter models? Or was it bad to keep supporting people who kept on scaling capabilities?
Rationale #3: ‘we reduce the hardware overhang now to prevent disruption later’
Paul sums this up well:
Sam Altman also wrote about this in ‘Planning for AGI and Beyond’:
It’s unclear what “many of us” means, and I do not want to presume that Sam accurately represented the views of his employees. But the draft was reviewed by “Paul Christiano, Jack Clark, Holden Karnofsky” – all of whom were already collaborating with Dario.
The rationale of reducing hardware overhang is flawed:
It accelerates hardware production. Using more hardware increases demand for that hardware, triggering a corresponding increase in supply. Microsoft did not just provide more of its data centers to OpenAI, but also built more data centers to house more chips it could buy from Nvidia. Nvidia in turn reacted by scaling up production of its chips, especially once there was a temporary supply shortage.
It is a justification that can be made just as well by someone racing to the bottom. Sam Altman not only tried to use the hardware overhang. Once chips got scarce, Sam pitched the UAE to massively invest in new chip manufacturing. And Tom Brown just before leaving to Anthropic, was in late-stage discussions with Fathom Radiant to get cheap access to their new fibre-optic-connected supercomputer.[8]
It assumes that ‘AGI’ is inevitable and/or desirable. Yet new technologies can be banned (especially when still unprofitable and not depended on by society). And there are sound, nonrefuted reasons why keeping these unscoped autonomously learning and operating machine systems safe would actually be intractable.
2. Founded an ‘AGI’ development company and started competing on capabilities
Karen Hao reports on the run-up to Dario’s circle leaving OpenAI:
There is a repeating pattern here:
Founders of an AGI start-up air their concerns about ‘safety’, and recruit safety-concerned engineers and raise initial funding that way. The culture sours under controlling leaders, as the company grows dependent on Big Tech’s compute and billion-dollar investments.
This pattern has roughly repeated three times:
DeepMind was funded by Jaan Tallinn and Peter Thiel (intro’d by Eliezer). Mustafa Suleyman got fired for abusively controlling employees, but Demis Hassabis stayed on. Then DeepMind lost so much money that it had to be acquired by Google.
Distrusting DeepMind (as now directed by Demis under Google), Sam Altman and Elon Musk founded the nonprofit OpenAI. Sam and Elon fight for the CEO position, and Sam gains control. Holden made a grant to this nonprofit, which subsequently acted illegally as a for-profit under investments by Microsoft.
Distrusting OpenAI (as now directed by Sam to appease Microsoft), Daniela and Dario left to found Anthropic. Then all the top billionaires in the safety community invested. Then Anthropic received $8 billion in investments from Amazon.
We are dealing with a gnarly situation.
Principal-agent problem: The safety community supports new founders who convince them they’ll do good work for the cause of safety. For years, safety folks believe the leaders. But once cases of dishonesty get revealed and leaders sideline safety people who are no longer needed, the community distrusts the leaders and allies with new founders.
Rules for rulers: Leaders seek to gain and maintain their positions of power over AI development. In order to do so, they need to install key people who can acquire the resources needed for them to stay in power, and reward those key people handsomely, even if it means extracting from all the other outside citizens who have no say about what the company does.
Race to the bottom: Collaborators at different companies cut corners believing that if they don’t, then their competitors might get there first and make things even worse. The more the people participating treat this as a finite game in which they are acting independently from other untrusted individual players, the more they lose integrity with their values.
One take on this is a brutal realist stance: That’s just how business gets done. They convince us to part with our time and money and drop us when we’re no longer needed, they gather their loyal lackeys and climb to the top, and then they just keep playing this game of extraction until they’ve won.
It is true that’s how business gets done. But I don’t think any of us here are just in it for the business. Safety researchers went to work at Anthropic because they care. I wouldn’t want us to tune out our values – but it’s important to discern where Anthropic’s leaders are losing integrity with the values we shared.
The safety community started with much trust in and willingness to support Anthropic. That sentiment seems to be waning. We are seeing leaders starting to break some commitments and enter into shady deals like OpenAI leaders did – allowing them to gain relevance in circles of influence, and to keep themselves and their company on top.
Something like this happened before, so discernment is needed. It would suck if we support another ‘safety-focussed’ start-up that ends up competing on capabilities.
I’ll share my impression of how Anthropic staff presented their commitments to safety in the early days, and how this seemed in increasing tension with how the company acted.
Early commitments
In March 2023, Anthropic published its ‘Core Views on AI Safety’:
The general impression I came away with was that Anthropic was going to be careful not to release models with capabilities that significantly exceeded those of ChatGPT and other competing products. Instead, Anthropic would compete on having a reliable and safe product, and try to pull competitors into doing the same.
Dario has repeatedly called for a race to the top on safety, such as in this Time piece.
Degrading commitments
After safety-allied billionaires invested in Series A and B, Anthropic’s leaders moved on to pitch investors outside of the safety community.
On April 2023, TechCrunch leaked a summary of the Series C pitchdeck:
Some people in the safety community commented with concerns. Anthropic leaders seemed to act like racing on capabilities was necessary. It felt egregious compared to the expectations that I and friends in safety had gotten from Anthropic. Worse, leaders had kept these new plans hidden from the safety community – it took a journalist to leak it.
From there, Anthropic started releasing models with capabilities that ChatGPT lacked:
In July 2023, Anthropic was the first to introduce a large context window reaching 100,000 tokens (about 75,000 words) compared to ChatGPT’s then 32,768 tokens.
In March 2024, Anthropic released Claude Opus, which became preferred by programmers for working on large codebases. Anthropic’s largest customer is Cursor, a coding platform.
In October 2024, Anthropic was first to release an ‘agent’ that automatically directs actions in the computer browser. It was a beta release that worked pretty poorly and could potentially cause damage for customers.
None of these are major advancements beyond state of the art. You could argue that Anthropic stuck to original commitments here, either deliberately or because they lacked anything substantially more capable than OpenAI to release. Nonetheless, they were competing on capabilities, and the direction of those capabilities is concerning.
If a decade ago, safety researchers had come up with a list of engineering projects to warn about, I guess it would include ‘don’t rush to build agents’, and ‘don’t connect the agent up to the internet’ and ‘don’t build an agent to code by itself’. While the notion of current large language models actually working as autonomous agents is way overhyped, Anthropic engineers are developing models in directions that would have scared early AGI safety researchers. Even from a system safety perspective, it’s risky to build an unscoped system that can modify surrounding infrastructure in unexpected ways (by editing code, clicking through browsers, etc).
Anthropic has definitely been less reckless than OpenAI in terms of model releases.
I just think that ‘less reckless’ is not a good metric. ‘Less reckless’ is still reckless.
Another way to look at this is that Dario, like other AI leaders before him, does not think he is acting recklessly, because he thinks things likely go well anyway – as he kept saying:
Declining safety governance
The most we can hope for is oversight by their board, or by the trust set up to elect new board members. But the board’s most recent addition is Reed Hastings, known for scaling a film subscription company, but not a safe engineering culture. Indeed, the reason given is that Reed “brings extensive experience from founding and scaling Netflix into a global entertainment powerhouse”. Before that, trustees elected Jay Krepps, giving a similar reason: his “extensive experience in building and scaling highly successful tech companies will play an important role as Anthropic prepares for the next phase of growth”. Before that, Yasmin Razavi from Spark Capital joined, for making the biggest investment in the Series C round.
The board lacks any independent safety oversight. It is presided by Daniela Amodei, who along with Dario Amodei has remained there since founding Anthropic. For the rest, three tech leaders joined, prized for their ability to scale companies. There used to be one independent-ish safety researcher, Luke Muehlhauser, but he left one year ago.
The trust itself cannot be trusted. It was supposed to “elect a majority of the board” for the sake of long-term interests such as “to carefully evaluate future models for catastrophic risks”. Instead, trustees brought in two tech guys who are good at scaling tech companies. The trust was also meant to be run by five trustees, but it’s been under that count for almost two years – they failed to replace trustees after two left.
3. Lobbied for policies that minimised Anthropic’s accountability for safety
Jack Clark has been the policy director at Anthropic ever since he left the same role at OpenAI. Under Jack, some of the policy advocacy tended to reduce Anthropic’s accountability. There was a tendency to minimise Anthropic having to abide by any hard or comprehensive safety commitments.
Much of this policy work is behind closed doors. But I rely on just some online materials I’ve read.
I’ll focus on two policy initiatives discussed at length in the safety community:
Anthropic’s advocacy for minimal ‘Responsible Scaling Policies’
Anthropic’s lobbying against provisions in California’s safety bill SB 1047.
Minimal ‘Responsible Scaling Policies’
In September 2023, Anthropic announced its ‘Responsible Scaling Policy’.
Anthropic’s RSPs are well known in the safety community. I’ll just point to the case made by Paul Christiano, a month after he joined Anthropic’s Long-Term Benefit Trust:
While Paul did not wholeheartedly endorse RSPs, and included some reservations, the thrust of it is that he encouraged the safety community to support Anthropic’s internal and external policy work on RSPs.[9]
A key issue with RSPs is how they’re presented as ‘good enough for now’. If companies adopt RSPs voluntarily, the argument goes, it’d lay the groundwork for regulations later.
Several authors on the forum argued that this was misleading.
– Siméon Campos:
–Oliver Habryka:
–Remmelt Ellen (me):
At the time, Anthropic’s policy team was actively lobbying for RSPs in US and UK government circles. This bore fruit. Ahead of the UK AI Safety Summit, leading AI companies were asked to outline their responsible capability scaling policy. Both OpenAI and Deepmind soon released their own policies on ‘responsible’ scaling.
Some policy folks I knew were so concerned that they went on trips to advocate against RSPs. Organisers put out a treaty petition as a watered–down version of the original treaty, because they wanted to get as many signatories from leading figures, in part to counter Anthropic’s advocacy for self-regulation through RSPs.
Opinions here differ. I think that Anthropic advocated for companies to adopt overly minimal policies that put off accountability for releasing models that violate already established safe engineering practices. I’m going to quote some technical researchers who are experienced in working with and/or advising on these established practices:
Siméon Campos wrote on existing risk management frameworks:
Heidy Khlaaf wrote on scoped risk assessments before joining UK’s AI Safety Institute:
Timnit Gebru replied on industry shortcomings to the National Academy of Engineering:
Once, a senior safety engineer on medical devices messaged me, alarmed after the release of ChatGPT. It boggled her that such an unscoped product could just be released to the public. In her industry, medical products have to be designed for a clearly defined scope (setting, purpose, users) and tested for safety in that scope. This all has to be documented in book-sized volumes of paper work, and the FDA gets the final say.
Other established industries also have lengthy premarket approval processes. New cars and planes too must undergo audits, before a US government department decides to deny or approve the product’s release to market.
The AI industry, however, is an outgrowth of the software industry, which has a notorious disregard of safety. Start-ups sprint to code up a product and rush through release stages.
XKCD put it well:
At least programmers at start-ups write out code blocks with somewhat interpretable functions. Auto-encoded weights of LLMs, on the other hand, are close to inscrutable.
So that’s the context Anthropic is operating in.
Safety practices in the AI industry are often appalling. Companies like Anthropic ‘scale’ by automatically encoding a model to learn hidden functionality from terabytes of undocumented data, and then marketing it as a product that can be used everywhere.
Releasing unscoped automated systems like this is a set-up for insidious and eventually critical failures. Anthropic can’t evaluate Claude comprehensively for such safety issues.
Staff do not openly admit that they are acting way out of the bounds of established safety practices. Instead, they expect us to trust them having some minimal responsibilities for scaling models. Rather than a race to the top, Anthropic cemented a race to the bottom.
I don’t deny the researchers’ commitment – they want to make general AI generally safe. But if the problem turns out too complex to adequately solve for, or they don’t follow through, we’re stuffed.
Unfortunately, their leaders recently backpedalled on one internal policy commitment:
Anthropic staked out its responsibility for designing models to be safe, which is minimal. It can change its internal policy any time. We cannot trust its board to keep leaders in line.
This leaves external regulation. As Paul wrote:
“I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should demand a higher degree of safety.”
Unfortunately, Anthropic has lobbied to cut down regulations that were widely supported by the public. The clearest case of this is California’s safety bill SB 1047.
Lobbied against provisions in SB 1047
The bill’s demands were light, mandating ‘reasonable care’ in training future models to prevent critical harms. It did not even apply to the model that Anthropic pitched to investors, as requiring 1025 FLOPS for training. The bill only kicked in at a computing power greater than 1026.
Yet Anthropic lobbied against the bill:
Anthropic’s leaders did not want to be burdened by having to follow certain government-mandated requirements before critical harms occurred. Given how minimal the newly proposed requirements were, in an industry that severely lacks safety enforcement, such lobbying was actively detrimental to safety.
Anthropic also acted in ways that increased the chance that the bill would be killed:
Only after amendments, did Anthropic weakly support the bill. At this stage, Dario wrote a letter to Gavin Newsom. This was seen as one of the most significant positive developments at the time by supporters of the bill. But it was too little too late. Newsom canned the bill.
Similarly this year, Dario wrote against the 10 year moratorium on state laws – three weeks after the stipulation was introduced.[10] Still it was a major contribution for a leading AI company to speak out against the moratorium.
All of this raises a question: how much does Dario’s circle actually want to be held to account on safety, over being free to innovate how they want?
Let me end with some commentary by Jack Clark:
4. Built ties with AI weapons contractors and the US military
If an AI company’s leaders are committed to safety, one red line to not cross is building systems used to kill people. People dying because of your AI system is a breach of safety.
Once you pursue getting paid billions of dollars to set up AI for the military, you open up bad potential directions for yourself and other companies. A next step could be to get paid to optimise AI to be useful for the military, which in wartime means optimise for killing.
For large language models, there is particular concern around using ISTAR capabilities: intelligence, surveillance, target acquisition, and reconnaissance. Commercial LLMs can be used to investigate individual persons to target, because those LLMs are trained on lots of personal data scraped everywhere from the internet, as well as private chats. The Israeli military already uses LLMs to identify maybe-Hamas-operatives, to track them down and bomb the entire apartment buildings that they live in. Its main strategic ally is the US military-industrial-intelligence complex, which offers explosive ammunition and cloud services to the Israeli military, and is adopting some of the surveillance tactics that Israel has tested in Palestine, though with much less deadly consequences for now.
So that’s some political context. What does any of this have to do with Anthropic?
Anthropic’s intel-defence partnership
Anthropic did not bind itself to not offering models for military or warfare uses, unlike OpenAI. Before OpenAI broke its own prohibition, Anthropic already went ahead without as much public backlash.
In November 2024, Anthropic partnered with Palantir and Amazon to “provide U.S. intelligence and defense agencies access to the Claude 3 and 3.5 family...on AWS”:
Later that month, Amazon invested another $4 billion in Anthropic, which raises a conflict of interest. If Anthropic hadn’t agreed to hosting models for the military using AWS, would Amazon still have invested? Why did Anthropic go ahead with partnering with Palantir, a notorious mass surveillance and autonomous warfare contractor?
No matter how ‘responsible’ Anthropic presents itself to be, it is concerning how its operations are starting to get tied to the US military-industrial-intelligence complex.
In June 2025, Anthropic launched Claude Gov models for US national security clients. Along with OpenAI, it got a $200 million defence contract: “As part of the agreement, Anthropic will prototype frontier AI capabilities that advance U.S. national security.”
I don’t know about you, but prototyping “frontier AI capabilities” for the military seems to swerve away from their commitment to being careful about “capability demonstrations”. I guess Anthropic’s leaders would push for improving model security and preventing adversarial uses, and avoid the use of their models for military target acquisition. Yet Anthropic can still end up contributing to the automation of kill chains.
For one, Anthropic leaders will know little about what US military and intelligence agencies actually use Claude for, since “access to these models is limited to those who operate in such classified environments”. Though from a business perspective, it is a plus to run their models on secret Amazon servers, since Anthropic can distance itself from any mass atrocity committed by their military client. Like Microsoft recently did.
Anthropic’s earlier ties
Amazon is a cloud provider for the US military, raising conflicts of interest for Anthropic. But already before, Anthropic’s leaders had ties to military AI circles. Anthropic received a private investment from Eric Schmidt. Eric is about the best-connected guy in AI warfare. Since 2017, Eric illegally lobbied the Pentagon, chaired two military-AI committees, and then spun out his own military innovation thinktank styled after Henry Kissinger’s during the Vietnam war. Eric invests in drone swarm start-ups and frequently talks about making network-centric autonomous warfare really cheap and fast.
Eric in turn is an old colleague of Jason Matheny. Jason used to be a trustee of Anthropic and still heads the military strategy thinktank RAND corporation. Before that, Jason founded the thinktank Georgetown CSET to advise on national security concerns around AI, receiving a $55 million grant through Holden. Holden in turn is the husband of Daniela and used to be roommates with Dario.
One angle on all this: Dario’s circle acted prudently to gain seats at the military-AI table.
Another angle: Anthropic stretched the meaning of ‘safety first’ to keep up its cashflow.
5. Promoted band-aid fixes to speculative risks over existing dangers that are costly to address
In a private memo to employees, Dario wrote he had decided to solicit investments from Gulf State sovereign funds tied to dictators, despite his own misgivings.
A journalist leaked a summary of the memo:
Dario was at pains to ensure that authoritarian governments in the Middle East and China would not gain a technological edge:
But here’s the rub. Anthropic never raised the issue of tech-accelerated authoritarianism in the US. This is a clear risk. By AI-targeted surveillance and personalised propaganda, US society too can get locked into a totalitarian state, beyond anything Orwell imagined.
Talking about that issue would cost Anthropic. It would force leaders to reckon with that they are themselves partnering with a mass surveillance contractor to provide automated data analysis to the intelligence arms of an increasingly authoritarian US government. To end this partnership with its main investor, Amazon, would just be out of the question.
Anthropic also does not campaign about the risk of runaway autonomous warfare, which it is gradually getting tied into by partnering with Palantir to contract for the US military.
Instead, it justifies itself as helping a democratic regime combat overseas authoritarians.
Dario does keep warning of the risk of mass automation. Why pick that? Mass job loss is bad, but surely not worse than US totalitarianism or autonomous US-China wars. It can’t be just that job loss is a popular topic Dario gets asked about. Many citizens are alarmed too – on the left and on the right – about how ‘Big Tech’ enables ‘Deep State’ surveillance.
The simplest fitting explanation I found is that warning about future mass automation benefits Anthropic, but warning about how the tech is accelerating US authoritarianism or how the US military is developing a hidden kill cloud is costly to Anthropic.
I’m not super confident about this explanation. But it roughly fits the advocacy I’ve seen.
Anthropic can pitch mass automation to its investors – and it did – going so far as providing a list of targetable jobs. But Dario is not alerting the public about automation now. He is not warning how genAI already automates away the fulfilling jobs of writers and artists, driving some to suicide. He is certainly not offering to compensate creatives for billions of dollars in income loss. Recently, Anthropic’s lawyers warned it may go bankrupt if it had to pay for having pirated books. That is a risk Anthropic worries about.
Cheap fixes for risks that are still speculative
Actually much of Anthropic’s campaigns are on speculative risks covered little in society.
Recently it campaigned on model welfare, which is a speculative matter, to put it lightly. Here the current solution is cheap – a feature where Claude can end ‘abusive’ chats.
Anthropic has also campaigned about models generating advice that removes bottlenecks for producing bioweapons. So far, it’s not a huge issue – Claude could regenerate guidance found on webpages that Anthropic scraped from, but those can easily be found back by any actor resourceful enough to follow through. Down the line, this could well turn into a dangerous threat. But focussing on this risk now has a side-benefit: it mostly does not cost Anthropic nor hinder it from continuing to scale and release models.
To deal with the bioweapons risk now, Anthropic uses cheap fixes. Anthropic did not thoroughly document its scraped datasets to prevent its models from regenerating various toxic or dangerous materials. That is sound engineering practice, but too costly. Instead, researchers came up with a classifier that will filter out some (but not all) of the materials depending on the assumptions that the researchers made in designing the classifier.[11] And trained models to block answers to bio-engineering-related questions.
None of this is to say that individual Anthropic researchers are not serious about ‘model welfare’ or ‘AI-enabled bioterrorism’. But their direction of work and costs they can make are supervised and okayed by the executives. Those executives are running a company that is bleeding cash, and needs to turn a profit to keep existing and to satisfy investors.
The company tends to campaign on risks that it can put off or offer cheap fixes for, but swerve around or downplay already widespread issues that would be costly to address.
When it comes to current issues, staff focus on distant frames that do not implicate their company – Chinese authoritarianism but not the authoritarianism growing in the US; future automation of white-collar work but not how authors lose their incomes now.
Example of an existing problem that is costly to address
Anthropic is incentivised to downplay gnarly risks that are already showing up and require high (opportunity) costs to mitigate.
So if you catch Dario inaccurately downplaying existing issues that would cost a lot to address, this provides some signal for how he will respond to future larger problems.
This is more than a question of what to prioritise.
You may not care about climate change, yet still be curious to see how Dario deals with the issue. Since he professes to care, but it’s costly to address.
If Dario is honest about the carbon emissions from Anthropic scaling up computation inside data centers, the implication from the viewpoint of environmental groups would be that Anthropic must stop scaling. Or at least, pay to clean up that pollution.
Instead, Dario makes a vaguely agnostic statement, claiming to be unsure whether their models accelerate climate change or not:
Multiple researchers have pointed out issues especially with Amazon (Anthropic’s provider) trying to game carbon offsets. From a counterfactual perspective, the carbon offsets are grossly underestimated. The low-hanging fruit will mostly be captured anyway, meaning that further extra carbon emissions by Amazon belong to the back of the queue of offset interventions. Also, Amazon’s data center offsets are only meant to compensate for carbon-based gas emissions at the end of the supply chain – not for all pollution across the entire hardware supply/operation chain.
Such ‘could be either’ thinking is overcomplicating the issue. Before a human did the task. Now this human still consumes energy to live, plus they or their company use energy-intensive Claude models. On net, this results in more energy usage.
Dario does not think he needs to act, because he hasn’t concluded yet that Anthropic’s scaling contributes to pollution. He might not be aware of it, but this line of thinking is similar to tactics used by Big Oil to sow public doubt and thus delay having to restrict production.
Again, you might not prioritise climate change. You might not worry like me about an auto-scaling dynamic, where if AI corps get profitable, they keep reinvesting profits into increasingly automated and expanding toxic infrastructure that extracts more profit.
What is still concerning is that Dario did not take a scout mindset here. He did not seek to find the truth about an issue he thinks we should worry about if it turns out negative.
Conclusion
Dario’s circle started by scaling up GPT’s capabilities at OpenAI, and then moved on to compete with Claude at Anthropic.
They offered sophisticated justifications, and many turned out flawed in important ways. Researchers thought they’d make significant progress on safety that competitors would fail to make, promoted the desirability or inevitability of scaling to AGI while downplaying complex intractable risks, believed their company would act responsibly to delay the release of more capable models, and/or trusted leaders to stick to commitments made to the safety community once their company moved on to larger investors.
Looking back, it was a mistake in my view to support Dario’s circle to start up Anthropic. At the time, it felt like it opened up a path for ensuring that people serious about safety would work on making the most capable models safe. But despite well-intentioned work, Anthropic contributed to a race to the bottom, by accelerating model development and lobbying for minimal voluntary safety policies that we cannot trust its board to enforce.
The main question I want to leave you with: how can the safety community do radically better at discerning company directions and coordinating to hold leaders accountable?
These investors were Dustin Moskovitz, Jaan Tallinn and Sam Bankman-Fried. Dustin was advised to invest by Holden Karnofsky. Sam invested $500 million through FTX, by far the largest investment.
Buck points out that 80K job board restricts Anthropic positions to those ‘vaguely related to safety or alignment’. I forgot to mention this.
Though Anthropic advertises research or engineering positions as safety-focussed, people joining Anthropic can still get directed to work on capabilities or product improvement. I posted a short list of related concerns before here.
As a metaphor, say a conservator just broke off the handles of a few rare Roman vases. Pausing for a moment, he pitches me for his project to ‘put vase safety first’. He gives sophisticated reasons for his approach – he believes he can only learn enough about vase safety by breaking parts of vases, and that if he doesn’t do it now, the vandals will eventually get ahead of him. After cashing in my paycheck, he resumes chipping away at the vase pieces. I walk off. What just happened, I wonder? After reflecting on it, I turn back. It’s not about the technical details, I say. I disagree with the premise – that there is no other choice but to damage Roman vases, in order to prevent the destruction of all Roman vases. I will stop supporting people, no matter how sophisticated their technical reasoning, who keep encouraging others to cause more damage.
Defined more precisely here.
As a small example, it is considerate that Anthropic researchers cautioned against educators grading students using Claude. Recently, Anthropic also intervened on hackers who used Claude’s code generation to commit large-scale theft. There are a lot of well-meant efforts coming from people at Anthropic, but I have to zoom out and look at the thrust of where things have gone.
Especially if OpenAI had not received a $30 million grant advised by Holden Karnofsky, who was co-living with the Amodei siblings at the time. This decision not only kept OpenAI afloat after its biggest backer, Elon Musk, left soon after. It also legitimised OpenAI to safety researchers who were skeptical at the time.
This resembles another argument by Paul for why it wasn’t bad to develop RLHF:
While still at OpenAI, Tom Brown started advising Fathom Radiant, a start-up that was building a supercomputer with faster connect using fibre-optic cables. Their discussion was around Fathom Radiant offering discounted compute to researchers to support differential progress on ‘alignment’.
This is based on my one-on-ones with co-CEO Michael Andregg. At the time, Michael was recruiting engineers from the safety community, turning up at the EA Global conferences, the EA Operations slack, and so on. Michael told me that OpenAI researchers had told him that Fathom Radiant was the most advanced technically out of all the start-ups they had looked at. Just after safety researchers left OpenAI, Michael told me that he was planning to support OpenAI by giving them compute to support their alignment work, but that they had now decided not to.
My educated guess is that Fathom Radiant moved on to offering low-price-tier computing services to Anthropic. A policy researcher I know in the AI Safety community told me he also thinks it’s likely. He told me that Michael was roughly looking for anything that could justify that what he’s doing is good.
It’s worth noting how strong Fathom Radiant’s ties were with people who later ended up in Anthropic C-suite. Virginia Blanton served as the Head of Legal and Operations at Fathom Radiant, and then left to be Director of Operations at Anthropic (I don’t know what she does now; her LinkedIn profile is deleted).
I kept checking online for news over the years. It now seems that Fathom Radiant’s supercomputer project has failed. Fathomradiant.co now redirects to a minimal company website for “Atomos Systems”, which builds “general-purpose super-humanoid robots designed to operate and transform the physical world”. Michael Andregg is also no longer shown in the team – only his brother William Andregg.
It’s also notable that Paul was now tentatively advocating for a pause on hardware development/production, after having justified reducing the hardware overhang in years prior.
Maybe Anthropic’s policy team already was advocating against the 10 year moratorium in private conversations with US politicians? In that case, kudos to them.
This reminds me of how LAION failed to filter out thousands of images of child sexual abuse from their popular image dataset, and then just fixed it inadequately with a classifier.