My understanding of Anthropic strategy
This post is the first half of a series about my attempts understand Anthropic’s current strategy and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic’s ops team.) As part of my research, I read a number of Anthropic’s published papers, and spoke to people within and outside of Anthropic.
This post contains “observations” only, which I wanted to write up as a reference for anyone considering similar questions. I will make a separate post about the inferences and conclusions I’ve reached personally about working at Anthropic, based on the info I’m sharing here.
Anthropic is planning to grow. They’re aiming to be one of the “top players”, competitive with OpenAI and Deepmind, working with a similar level of advanced models. They have received outside investment, because keeping up with state of the art is expensive, and going to get moreso. They’ve recently been hiring for a product team, in order to get more red-teaming of models and eventually have more independent revenue streams.
I think Anthropic believes that this is the most promising route to making AGI turn out well for humanity, so it’s worth taking the risk of being part of the competition and perhaps contributing to accelerating capabilities. Alternatively stated, Anthropic leadership believes that you can’t solve the problem of aligning AGI independently from developing AGI.
My current sense is that this strategy makes sense under a particular set of premises:
There is not, currently, an obviously better plan or route to solving alignment, that doesn’t involve keeping up with state-of-the-art large models. Yes, it’s a plan with some risks, but we don’t have any better ideas yet.
We don’t understand deep learning systems, and we don’t have a theoretical approach; we’re at the point where actually just running experiments on current models and observing the results is the best way to get information.
This could at some point lead to a more general theory or theories of alignment.
Or there may just be practical/empirical evidence of something like an “alignment attractor basin” and knowledge of how to practically stay in it
There’s a high enough probability that whatever method ends up getting us to AGI will be, basically, an extension and further exploration of current deep learning, rather than a completely new kind of architecture that doesn’t even share the same basic building blocks.
Note: there’s an argument that in worlds where Anthropic’s research is less useful, Anthropic is also contributing much less to actually-dangerous race dynamics, since faster progress in LLMs won’t necessarily lead to shorter timelines if LLMs aren’t a route to AGI.
There is, additionally, a high enough probability that behaviors observed in current-generation will also be a factor for much more advanced models.
(This isn’t a claim that understanding how to align GPT-3 is enough – we’ll need to understand the new and exciting behaviors and alignment challenges that start to emerge at higher levels too – but the knowledge base being fleshed out now will be at all applicable.)
It’s possible, in principle, to implement this strategy such that the additional progress on alignment-related questions and positive influence on norms in the field will more than cancel out the cost of accelerating progress – that even if it brings the point at which we hit AGI-level capabilities a few months or years earlier, in expectation it will move the point at which we have an alignment solution or process for reaching one earlier by a larger factor.
This relies on carefully tracking what will or won’t counterfactually accelerate capabilities development, and if necessary being willing to make genuine tradeoffs – in terms of profit from deploying products, or hiring brilliant researchers who don’t care about safety, or pleasing investors – but Anthropic, specifically, is in a position to carry through on that, and will continue to be in that position, avoiding future mission drift despite the potential risk of pressure from investors. A lot of care has been put into ensuring that investors have very little influence over internal decisions and priorities.
Anthropic will also continue to be in a position where if the landscape changes – if a better idea does appear, or if mission drift becomes super obvious to Dario and Daniela, or if for whatever reason Anthropic’s current strategy no longer seems like a good idea – then they’ll be able to pivot, and switch to a strategy that doesn’t require keeping up as one of the “top players” with all of the attendant risks.
I think someone could disagree or have doubts on any of these points, and I would like to know more about the range of opinions on 1-4 from people who have more technical AI safety background than I do. I’m mainly going to focus on 5, 6, and 7.
Implications for Anthropic’s structure and processes
The staff whom I spoke to believe that Anthropic’s leadership, and the Anthropic team as a whole, have thought very hard about this; that the leadership team applied considerable effort to setting the company up to avoid mission drift, and continue to be cautious and thoughtful around deploying advanced systems or publishing research.
Staff at Anthropic list the following as protective factors, some historical and some ongoing:
Anthropic’s founding team consists of, specifically, people who formerly led safety and policy efforts at OpenAI, and (I am told) there’s been very low turnover since then. To the extent that Anthropic’s plan relies on the leadership being very committed to prioritizing alignment, this is evidence in that direction.
Anthropic’s corporate structure is set up to try to mitigate some of the incentives problems with being a for-profit company that takes investment (and thus has fiduciary duties, and social pressure, to focus on profitable projects.) They do take investment and have a board of stakeholders, and plan to introduce a structure to ensure mission continues to be prioritized over profit.
Anthropic tries to strongly filter new hires for culture fit and taking the potential risks of AI seriously (though at some point this may be in tension with growing the team, and it may already be.) This means that they can have a strong internal culture of prioritizing safety and flagging concerns.
Anthropic’s internal culture supports all of its staff in expressing and talking about their doubts, and questioning whether deploying an advanced system or publishing a particular paper might be harmful, and these doubts are taken seriously.
There’s an argument to be made that OpenAI was already intending to push capabilities development as fast as possible, and so adding a new competitor to the ecosystem wasn’t going to give them any additional motive to go faster in order to stay ahead. (Though there are separate concerns about second-order effects, like generally “raising awareness” about the potential economic value of state-of-the-art models, and increasing investor “hype” in AI labs in general.)
While I don’t think anyone is ignoring the importance of second-order effects, there’s an argument that the first-order effect of Anthropic “competing” with OpenAI is that they might draw away investors and customers who would otherwise have funded OpenAI.
In this post I’ve done my best to neutrally report the information I have about Anthropic’s strategy, reasoning, and structure as relayed to me by staff and others who were kind enough to talk to me, and tried to avoid injecting my own worldview.
In my upcoming post (“Personal musings on Anthropic and incentives”), I intend to talk less neutrally about my reactions to the above and how it plays into my personal decision-making.
Note: I believe Anthropic thinks that large-scale, state-of-the-art models are necessary for their current work on constitutional AI and using AI-based reinforcement learning to train LMMs to be “helpful, harmless, and honest”, and that while some initial progress can be made on their mechanistic interpretability transformers work using smaller models, they also believe this will need to be scaled up in future to get the full value.
I am told that Anthropic has had three doublings of headcount in two years, which is closer to 3x year-over-year growth, and may stay at more like 2x year-over-year, and that this is nothing like OpenAI’s early growth rate of 8x (where purportedly no filtering for cultural fit/alignment interest was applied).