Automated / strongly-augmented safety research.
Bogdan Ionut Cirstea
Lukewarm take: the risk of the US sliding into autocracy seems high enough at this point that I think it’s probably more impactful now for EU citizens to work on pushing sovereign EU AGI capabilities, than on safety for US AGI labs.
this talk seems like an interesting and novel proposal to test for artificial consciousness, and for uploading, based on phenomena related to split-brain: https://www.youtube.com/watch?v=xNcOgYOvE_k
Seems like it might also be tractable to turn something like it, and reporting, into legislation which could meaningfully reduce x-risks.
I don’t think ‘catastrophe’ is the relevant scary endpoint; e.g., COVID was a catastrophe, but unlikely to have been x-risky. Something like a point-of-no-return (e.g. humanity getting disempowered) seems more relevant.
I’m pretty confident it’s feasible to at the very least 10x AI safety prosaic research through AI augmentation without increasing x-risk by more than 1% yearly (and that would probably be a conservative upper bound). For some intuition—see the low levels of x-risk that current AIs pose, while already having software engineering 50%-time-horizons of around 4 hours, and while already getting IMO gold medals. Both of these skills (coding and math) seem among the most useful for strongly augmenting AI safety research, especially since LLMs already seem like they might be human-level at (ML) research ideation.
Also, AFAICT, there are so many low hanging fruit to make current AIs safer, some of which I’d suspect are barely being used at all (and even with this relative recklessness, current AIs are still surprisingly safe and aligned—to the point where I think Claudes are probably already more beneficial and more prosocial companions than the median human). Things like unlearning / filtering the most dangerous and most antisocial data, or like production evaluations, or like trying harder to preserve CoT legibility through rephrasing or other forms of regularization, or, more speculatively, trying to use various forms of brain data for alignment.
I doubt this would be the ideal moment for a pause, even assuming it were politically tractable, which it obviously isn’t right now.
Very likely you’d want to pause after you’ve automated AI safety research, or at least strongly (e.g. 10x) accelerated at least prosaic AI safety research (none of which has happened yet) - given how small the current AI safety human workforce is, and how much more numerous (and very likely cheaper per equivalent hour of labor) an automated workforce would be.
There’s also this paper and benchmark/eval, which might provide some additional evidence: The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements.
Shenzhen team completed a working prototype of a EUV machine in early 2025, sources say
The lithography machine, built by former ASML engineers, fills a factory floor, sources say
China’s EUV machine is undergoing testing, and has not produced working chips, sources say
Government is targeting 2028 for working chips, but sources say 2030 is more likely’
This seems to assume that the quality of labor of a small, highly-selected number of researchers, can be more important than a much larger amount of somewhat lower-quality labor, from a much larger number of participants. Seems like a pretty dubious assumption, especially given that other strategies seem possible. E.g. using a larger pool of participants to produce more easily verifiable, more prosaic AI safety research now, even at the risk of lower quality, so as to allow for better alignment + control of the kinds of AI models which will in the future for the first time be able to automate the higher quality and maybe less verifiable (e.g. conceptual) research that fewer people might be able to produce today. Put more briefly: quantity can have a quality of its own, especially in more verifiable research domains.
Some of the claims around the quality of early rationalist / EA work also seem pretty dubious. E.g. a lot of the Yudkowsky-and-friends worldview is looking wildly overconfident and likely wrong.
GPT-5.1-Codex-Max (only) being on trend on METR’s task horizon eval, despite being ‘trained on agentic tasks across software engineering, math, research’, and being recommended for (less general) use ‘only for agentic coding tasks in Codex or Codex-like environments’, seems like very significant further evidence vs. trend breaks from quickly massively scaling up RL on agentic software engineering.
I think the WBE intuition is probably the more useful one, and even more so when it comes to the also important question of ‘how many powerful human-level AIs should there be around, soon after AGI’ - given e.g. estimates of computational requirements like in https://www.youtube.com/watch?v=mMqYxe5YkT4. Basically, WBEs set a bit of a lower bound ( given that they’re both a proof of existence and that, in many ways, the physical instantiations (biological brains) are there, lying in wait for better tech to access them in the right format and digitize them. Also, that better tech might be coming soon, especially as AI starts accelerating science and automating tasks more broadly—see e.g. https://www.sam-rodriques.com/post/optical-microscopy-provides-a-path-to-a-10m-mouse-brain-connectome-if-it-eliminates-proofreading.
I think these projects show that it’s possible to make progress on major technical problems with a few thousand talented and focused people.
I don’t think it’s impossible that this would be enough, but it seems much worse to risk undershooting than overshooting in terms of the resources allocated and the speed at which this happens; especially when, at least in principle, the field could be deploying even its available resources much faster than it currently is.
1. There’s likely to be lots of AI safety money becoming available in 1–2 years
I’m quite skeptical of this. As far as I understand, some existing entities (e.g. OpenPhil) could probably already be spending 10x more than they are today, without liquidity being a major factor. So the bottlenecks seem somewhere else (I personally suspect overly strong risk adversity and incompetence at scaling up grantmaking as major factors), and I don’t see any special reason why they’d be resolved in 1-2 years in particular (without them being about as resolvable next month, or in 5 years, or never).
Based on updated data and estimates from 2025, I estimate that there are now approximately 600 FTEs working on technical AI safety and 500 FTEs working on non-technical AI safety (1100 in total).
I think it’s suggestive to compare with e.g. the number of FTEs related to addressing climate change, for a hint at how puny the numbers above are:
Using our definition’s industry approach, UK employment in green jobs was an estimated 690,900 full-time equivalents (FTEs) in 2023. (https://www.ons.gov.uk/economy/environmentalaccounts/bulletins/experimentalestimatesofgreenjobsuk/july2025)
Jobs in renewable energy reached 16.2 million globally in 2023 (https://www.un.org/en/climatechange/science/key-findings)
spicy take: the ‘ultimate EA’ thing to do might soon be volunteering to get implanted with a few ultrasound BCIs (instead of e.g. donating a kidney), for lo-fi WBE data gathering reasons:
’The probe’s small size enables potential subcranial implantation between skull and dura with PDMS encapsulation (46), providing chronic hemodynamic access where repeated monitoring is valuable.’
‘The complete system captures brain activity up to 5-8 cm depth across a 60◦ × 60◦ field of view (FOV) at 1-10 Hz temporal resolution, while maintaining an 11.52 × 8.64 mm footprint suitable for integration into surgical workflows and future intracranial implantation.’
https://www.medrxiv.org/content/10.1101/2025.08.19.25332261v1.full-text
For some perspective:
‘New data centers put Stargate ahead of schedule to secure full $500 billion, 10-gigawatt commitment by end of 2025.’ https://openai.com/index/five-new-stargate-sites/
‘One estimate puts total funding for AI safety research at only $80-130 million per year over the 2021-2024 period.’ https://www.schmidtsciences.org/safetyscience/#:~:text=One%20estimate%20puts%20total%20funding,period%20(LessWrong%2C%202024)
NVIDIA might be better positioned to first get to/first scale up access to AGIs than any of the AI labs that typically come to mind.
They’re already the world’s highest-market-cap company, have huge and increasing quarterly income (and profit) streams, and can get access to the world’s best AI hardware at literally the best price (the production cost they pay). Given that access to hardware seems far more constraining of an input than e.g. algorithms or data, when AI becomes much more valuable because it can replace larger portions of human workers, they should be highly motivated to use large numbers of GPUs themselves and train their own AGIs, rather than e.g. sell their GPUs and buy AGI access from competitors. Especially since poaching talented AGI researchers would probably (still) be much cheaper than building up the hardware required for the training runs (e.g. see Meta’s recent hiring spree); and since access to compute is already an important factor in algorithmic progress and AIs will likely increasingly be able to substitute top human researchers for algorithmic progress. Similarly, since the AI software is a complementary good to the hardware they sell, they should be highly motivated to be able to produce their own in-house, and sell it as a package with their hardware (rather than have to rely on AGI labs to build the software that makes the hardware useful).This possibility seems to me wildly underconsidered/underdiscussed, at least in public.
I don’t have a strong opinion about how good or bad this is.
But it seems like potentially additional evidence over how difficult it is to predict/understand people’s motivations/intentions/susceptibility to value drift, even with decades of track record, and thus how counterfactually-low the bar is for AIs to be more transparent to their overseers than human employees/colleagues.
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time.
Isn’t the SWE-Bench figure and doubling time estimate from the blogpost even more relevant here than fig. 19 from the METR paper?
I think I agree directionally with the post.
But I’ve been increasingly starting to wonder if software engineering might not be surprisingly easy to automate when the right data/environments are used at much larger scale, e.g. Github issues (see e.g. D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff) or semi-automated pipelines to build SWE RL environments (see e.g. Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs), which seem potentially surprisingly easy to automatically scale up. It now seems much more plausible to me that this could be a scaling data problem than a scaling compute problem, and that progress might be fast. Also, it seems likely that there might be some flywheel effect of better AIs → better automated collection + filtering of SWE environments/data → better AIs, etc. And ‘Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs’ has already shown data scaling laws:
Also, my impression is that SWE is probably the biggest bottleneck in automating AI R&D, based on results like those in Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
and especially based on the length of the time horizons involved in the SWE part vs. other parts of the AI R&D cycle.
From https://pulltheplug.uk/:
I think this would probably be a disaster, given how misinformed and unwise large parts of the broad public have been on many other scientific issues (e.g. vaccines, GMOs, nuclear power).
The rest of their views doesn’t inspire much confidence in their epistemics either: