Software Engineer interested in AI and AI safety.
Stephen McAleese
Also that’s an average statistic and the distribution could be very uneven with some key projects having little AI generated code. For example, 90% of code written by AI could mean that there are nine straightforward web apps where AI is writing 100% of code and then a single algorithms codebase that contains the most valuable algorithms (e.g. tokenization, attention calculations) that is mostly hand written.
Andrej Karpathy recently appeared on the Dwarkesh podcast where he said that although he uses AI heavily for web apps, his new Nano Chat project was written with just AI autocomplete without agents.
That’s a good question. One approach I took is to look at the research agendas and outputs (e.g. Google DeepMinds AI safety research agenda) and estimate the number of FTEs based on those.
I would say that I’m including teams that are working full-time on advancing technical AI safety or interpretability (e.g. the GDM Mechanistic Interpretability Team).
To the best of my knowledge, there are a few teams like that at Google DeepMind and Anthropic though I could be underestimating given that these organizations have been growing rapidly over the past few years.
A weakness of this approach is that there could be large numbers of staff who sometimes work on AI safety and significantly increase the effective number of AI safety FTEs at the organization.
Good observation, thanks for sharing.
One possible reason is that I’ve included more organizations in this updated post and this would raise many estimates.
Another reason is that in the old post, I used a linear model that assumed that an organization started with 1 FTE when founded and linearly increased until the current number (example: an organization has 10 FTEs in 2025 and was founded in 2015. Assume 1 FTE in 2015, 2 FTEs in 2016 … 10 in 2025).
The new model is simpler and just assumes the current number for all years (e.g. 10 in 2015 and 10 in 2025) so it’s estimates for earlier years are higher than the previous model. See my response to Daniel above.
I think it’s hard to pick a reference class for the field of AI safety because the number of FTEs working on comparable fields or projects can vary widely.
Two extremes examples:
- Apollo Program: ~400,000 FTEs
- Law of Universal Gravitation: 1 FTE (Newton)Here are some historical challenges which seem comparable to AI safety since they are technical, focused on a specific challenge, and relatively recent [1]:
Pfizer-BioNTech vaccine (2020): ~2,000 researchers and ~3,000 FTEs for manufacturing and logistics
Human genome project (1990 − 2003): ~3,000 researchers across ~20 major centers
ITER fusion experiment (2006 - present): ~2,000 engineers and scientists, ~5000 FTEs in total
CERN and LHC (1994 - present): ~3000 researchers working onsite, ~15,000 collaborators arouond the world.
I think these projects show that it’s possible to make progress on major technical problems with a few thousand talented and focused people.
- ^
These estimates were produced using ChatGPT with web search.
I’m pretty sure that’s just a mistake. Thanks for spotting it! I’ll remove the duplicated row.
For each organization, I estimated the number of FTEs by looking at the team members page, LinkedIn, and what kinds of outputs have been produced by the organization and who is associated with them. Then the final estimate is an intuitive guess based on this information.
Thanks for your helpful feedback Daniel. I agree that the estimate for 2015 (~50 FTEs) is too high. The reason why is that the simple model assumes that the number of FTEs is constant over time as soon as the organization is founded.
For example, the FTE value associated with Google DeepMind is 30 today and the company was founded in 2010 so the value back then is probably too high.
Perhaps a more realistic model would assume that the organization has 1 FTE when founded and linearly increases. Though this model would be inaccurate for organizations that grow rapidly and then plateau in size after being founded.
AI Safety Field Growth Analysis 2025
Thanks for the post. It covers an important debate: whether mechanistic interpretability is worth pursuing as a path towards safer AI. The post is logical and makes several good points but I find it’s style too formal for LessWrong and it could be rewritten to be more readable.
Thank you for taking the time to comment and for pointing out some errors in the post! Your attention to detail is impressive. I updated the post to reflect your feedback:
I removed the references to S1 and S2 in the IOI description and fixed the typos you mentioned.
I changed “A typical vector of neuron activations such as the residual stream...” to “A typical activation vector such as the residual stream...”
Good luck with the rest of the ARENA curriculum! Let me know if you come across anything else.
While I disagree with a lot of this post, I thought it was interesting and I don’t think it should have negative karma.
I haven’t heard anything about RULER on LessWrong yet:
RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for hand-crafted reward functions by using an LLM-as-judge to automatically score agent trajectories. Simply define your task in the system prompt, and RULER handles the rest—no labeled data, expert feedback, or reward engineering required.
✨ Key Benefits:
2-3x faster development—Skip reward function engineering entirely
General-purpose—Works across any task without modification
Strong performance—Matches or exceeds hand-crafted rewards in 3⁄4 benchmarks
Easy integration—Drop-in replacement for manual reward functions
Apparently it allows LLM agents to learn from experience and significantly improves reliability.
These talks are fascinating. Thanks for sharing.
Understanding LLMs: Insights from Mechanistic Interpretability
Great post, it explained some of the economics of job automation in simple terms and clarified my thinking on the subject which is not easy to do. This post has fewer upvotes than it should have.
An alternative idea is to put annual quotas on GPU production. The oil and dairy industries already do this to control prices and the fishing industry does it to avoid overfishing.
Thank you for the reply!
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
It’s hard to write spam that gets past an email spam classifier.
It’s hard to jailbreak LLMs.
It’s hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.
That said, from what I’ve read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
So I think we’ll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
In the post you say that human programmers will write the AI’s reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
But it seems likely to me that programmers won’t know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach “direct specification” of values and argues that it’s naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:
The human programmers have a dataset of correct behaviors or a natural language description of what they want and they use this information to create a reward function or model automatically (e.g. Text2Reward).
This learned reward model or generated code is used to train the policy.
If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI’s behavior directly.
Edit: see AGI will have learnt reward functions for an in-depth post on the subject.
I think it depends on the context. It’s the norm for employees in companies to have managers though as @Steven Byrnes said, this is partially for motivational purposes since the incentives of employees are often not fully aligned with those of the company. So this example is arguably more of an alignment than a capability problem.
I can think of some other examples of humans acting in highly autonomous ways:
To the best of my knowledge, most academics and PhD students are expected to publish novel research in a highly autonomous way.
Novelists can work with a lot of autonomy when writing a book (though they’re a minority).
There are also a lot of personal non-work goals like saving for retirement or raising kids which require high autonomy over a long period of time.
Small groups of people like a startup can work autonomously for years without going off the rails like a group of LLMs probably would after a while (e.g. the Claude bliss attractor).
Excellent post, thank you for taking the time to articulate your ideas in a high-quality and detailed way. I think this is a fantastic addition to LessWrong and the Alignment Forum. It offers a novel perspective on AI risk and does so in a curious and truth-seeking manner that’s aimed at genuinely understanding different viewpoints.
Here are a few thoughts on the content of the first post:
I like how it offers a radical perspective on AGI in terms of human intelligence and describes the definition in an intuitive way. This is necessary as increasingly AGI is being redefined as something like “whatever LLM comes out next year”. I definitely found the post illuminating and resulted in a perspective shift because it described an important but neglected vision of how AGI might develop. It feels like the discourse around LLMs is sucking the oxygen out of the room, making it difficult to seriously consider alternative scenarios.
I think the basic idea in the post is that LLMs are built by applying an increasing amount of compute to transformers trained via self-supervised or imitation learning but LLMs will be replaced by a future brain-like paradigm that will need much less compute while being much more effective.
This is a surprising prediction because it seems to run counter to Rich Sutton’s bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I’m just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.
On the other hand, there is an ongoing trend of algorithmic progress and increasing computational efficiency which could smoothly lead to the future described in this post (though the post seems to describe a more discontinuous break between current and future AI paradigms).
If the post’s prediction comes true, then I think we might see a new “biological lesson”: brain-like algorithms will replace deep learning which replaced GOFAI.
I think you’re overstating how difficult it is for the government to regulate AI. With the exception of SB 53 in California, the reason not much has happened yet is that there have been barely any attempts by governments to regulate AI. I think all it would take is for some informed government to start taking this issue seriously (in a way that LessWrong people already do).
I think this may be because the US government tends to take a hands off approach and assume the market knows best which is usually true.
I think it will be informative to see how China handles this because they have a track record of heavy-handed government interventions like banning Google, the 2021 tech industry crackdown, extremely strict covid lockdowns and so on.
From some quick research online, the number of private tutoring institutions and the revenue of the private tutoring sector fell by ~80% when the Chinese government banned for-profit tutoring in 2021 despite education having pretty severe arms race dynamics similar to AI.