Bogdan Ionut Cirstea

Karma: 1,733

Automated / strongly-augmented safety research.

Bogdan Ionut Cirstea 18 Oct 2025 15:37 UTC
7 points
2
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I think the WBE intuition is probably the more useful one, and even more so when it comes to the also important question of ‘how many powerful human-level AIs should there be around, soon after AGI’ - given e.g. estimates of computational requirements like in https://www.youtube.com/watch?v=mMqYxe5YkT4. Basically, WBEs set a bit of a lower bound ( given that they’re both a proof of existence and that, in many ways, the physical instantiations (biological brains) are there, lying in wait for better tech to access them in the right format and digitize them. Also, that better tech might be coming soon, especially as AI starts accelerating science and automating tasks more broadly—see e.g. https://www.sam-rodriques.com/post/optical-microscopy-provides-a-path-to-a-10m-mouse-brain-connectome-if-it-eliminates-proofreading.

Bogdan Ionut Cirstea 1 Oct 2025 10:15 UTC
2 points
0
in reply to: Stephen McAleese’s comment on: AI Safety Field Growth Analysis 2025
I think these projects show that it’s possible to make progress on major technical problems with a few thousand talented and focused people.
I don’t think it’s impossible that this would be enough, but it seems much worse to risk undershooting than overshooting in terms of the resources allocated and the speed at which this happens; especially when, at least in principle, the field could be deploying even its available resources much faster than it currently is.

Bogdan Ionut Cirstea 29 Sep 2025 11:00 UTC
9 points
1
on: Reasons to sell frontier lab equity to donate now rather than later
1. There’s likely to be lots of AI safety money becoming available in 1–2 years
I’m quite skeptical of this. As far as I understand, some existing entities (e.g. OpenPhil) could probably already be spending 10x more than they are today, without liquidity being a major factor. So the bottlenecks seem somewhere else (I personally suspect overly strong risk adversity and incompetence at scaling up grantmaking as major factors), and I don’t see any special reason why they’d be resolved in 1-2 years in particular (without them being about as resolvable next month, or in 5 years, or never).

Bogdan Ionut Cirstea 27 Sep 2025 19:59 UTC
5 points
−2
on: AI Safety Field Growth Analysis 2025
Based on updated data and estimates from 2025, I estimate that there are now approximately 600 FTEs working on technical AI safety and 500 FTEs working on non-technical AI safety (1100 in total).
I think it’s suggestive to compare with e.g. the number of FTEs related to addressing climate change, for a hint at how puny the numbers above are:
Using our definition’s industry approach, UK employment in green jobs was an estimated 690,900 full-time equivalents (FTEs) in 2023. (https://www.ons.gov.uk/economy/environmentalaccounts/bulletins/experimentalestimatesofgreenjobsuk/july2025)
Jobs in renewable energy reached 16.2 million globally in 2023 (https://www.un.org/en/climatechange/science/key-findings)

Bogdan Ionut Cirstea 25 Sep 2025 12:46 UTC
2 points
0
on: Bogdan Ionut Cirstea’s Shortform
spicy take: the ‘ultimate EA’ thing to do might soon be volunteering to get implanted with a few ultrasound BCIs (instead of e.g. donating a kidney), for lo-fi WBE data gathering reasons:
’The probe’s small size enables potential subcranial implantation between skull and dura with PDMS encapsulation (46), providing chronic hemodynamic access where repeated monitoring is valuable.’
‘The complete system captures brain activity up to 5-8 cm depth across a 60◦ × 60◦ field of view (FOV) at 1-10 Hz temporal resolution, while maintaining an 11.52 × 8.64 mm footprint suitable for integration into surgical workflows and future intracranial implantation.’
https://www.medrxiv.org/content/10.1101/2025.08.19.25332261v1.full-text

Bogdan Ionut Cirstea 24 Sep 2025 20:02 UTC
15 points
0
on: Bogdan Ionut Cirstea’s Shortform
For some perspective:
‘New data centers put Stargate ahead of schedule to secure full $500 billion, 10-gigawatt commitment by end of 2025.’ https://openai.com/index/five-new-stargate-sites/
‘One estimate puts total funding for AI safety research at only $80-130 million per year over the 2021-2024 period.’ https://www.schmidtsciences.org/safetyscience/#:~:text=One%20estimate%20puts%20total%20funding,period%20(LessWrong%2C%202024)

Bogdan Ionut Cirstea 5 Sep 2025 13:59 UTC
43 points
−4
on: Bogdan Ionut Cirstea’s Shortform
NVIDIA might be better positioned to first get to/first scale up access to AGIs than any of the AI labs that typically come to mind.
They’re already the world’s highest-market-cap company, have huge and increasing quarterly income (and profit) streams, and can get access to the world’s best AI hardware at literally the best price (the production cost they pay). Given that access to hardware seems far more constraining of an input than e.g. algorithms or data, when AI becomes much more valuable because it can replace larger portions of human workers, they should be highly motivated to use large numbers of GPUs themselves and train their own AGIs, rather than e.g. sell their GPUs and buy AGI access from competitors. Especially since poaching talented AGI researchers would probably (still) be much cheaper than building up the hardware required for the training runs (e.g. see Meta’s recent hiring spree); and since access to compute is already an important factor in algorithmic progress and AIs will likely increasingly be able to substitute top human researchers for algorithmic progress. Similarly, since the AI software is a complementary good to the hardware they sell, they should be highly motivated to be able to produce their own in-house, and sell it as a package with their hardware (rather than have to rely on AGI labs to build the software that makes the hardware useful).
This possibility seems to me wildly underconsidered/underdiscussed, at least in public.

Bogdan Ionut Cirstea 5 Sep 2025 10:01 UTC
4 points
−2
in reply to: peterbarnett’s comment on: peterbarnett’s Shortform
I don’t have a strong opinion about how good or bad this is.
But it seems like potentially additional evidence over how difficult it is to predict/understand people’s motivations/intentions/susceptibility to value drift, even with decades of track record, and thus how counterfactually-low the bar is for AIs to be more transparent to their overseers than human employees/colleagues.

Bogdan Ionut Cirstea 3 Sep 2025 18:17 UTC
LW: 2 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time.
Isn’t the SWE-Bench figure and doubling time estimate from the blogpost even more relevant here than fig. 19 from the METR paper?

Bogdan Ionut Cirstea 20 Aug 2025 18:52 UTC
21 points
7
on: My AGI timeline updates from GPT-5 (and 2025 so far)
I think I agree directionally with the post.
But I’ve been increasingly starting to wonder if software engineering might not be surprisingly easy to automate when the right data/environments are used at much larger scale, e.g. Github issues (see e.g. D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff) or semi-automated pipelines to build SWE RL environments (see e.g. Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs), which seem potentially surprisingly easy to automatically scale up. It now seems much more plausible to me that this could be a scaling data problem than a scaling compute problem, and that progress might be fast. Also, it seems likely that there might be some flywheel effect of better AIs → better automated collection + filtering of SWE environments/data → better AIs, etc. And ‘Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs’ has already shown data scaling laws:

Also, my impression is that SWE is probably the biggest bottleneck in automating AI R&D, based on results like those in Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
and especially based on the length of the time horizons involved in the SWE part vs. other parts of the AI R&D cycle.

Bogdan Ionut Cirstea 5 Aug 2025 12:59 UTC
2 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
It’s probably also very synergystic with various d/acc approaches and measures more broadly.
Intuitively, the higher the (including global) barriers to AI takeover, human extinction, etc., the more likely the AI will have to do a lot of reasoning and planning to have a decent probability of success; so the more difficult it will be to successfully achieve that opaquely and not be caught by CoT (and inputs, and outputs, and tools, etc.) monitors.

Bogdan Ionut Cirstea 2 Aug 2025 12:03 UTC
2 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
And this suggests 100x acceleration in research cycles if ideation + implementation were automated, and humans were relegated to doing peer reviewing of AI-published papers:
https://x.com/BogdanIonutCir2/status/1940100507932197217

Bogdan Ionut Cirstea 31 Jul 2025 16:05 UTC
3 points
0
on: Can LLMs learn Steganographic Reasoning via RL?
I think it might really be wise to use a canary string, or some other mechanism to hide this kind of knowledge from future (pre)training runs, e.g. https://turntrout.com/dataset-protection

Bogdan Ionut Cirstea 29 Jul 2025 15:34 UTC
2 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Based on current trends fron https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/, this could already have happened by sometime between 2027 and 2030:

Bogdan Ionut Cirstea 28 Jul 2025 18:23 UTC
3 points
0
on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Excited about this research agenda, including its potential high synergy with unlearning from weights:
https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm.
Intuitively, if one (almost-verifiably) removes the most dangerous kinds of information from the weights, the model could then only access it in-context. If that’s in the (current) form of text, it should fit perfectly with CoT monitorability.
My guess is that on the current ‘default’ ML progress trajectory, this combo would probably buy enough safety all the way to automation of ~all prosaic AI safety research. Related, this other comment of mine on this post:
https://www.lesswrong.com/posts/7xneDbsgj6yJDJMjK/chain-of-thought-monitorability-a-new-and-fragile?commentId=eqidvTEaom4bmmSkv

Bogdan Ionut Cirstea 28 Jul 2025 16:59 UTC
4 points
2
in reply to: Daniel Kokotajlo’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
at least delay it for a while
Notably, even just delaying it until we can (safely) automate large parts of AI safety research would be both a very big deal, and intuitively seems quite tractable to me. E.g. the task-time-horizons required seem to be (only) ~100 hours for a lot of prosaic AI safety research:
https://x.com/BogdanIonutCir2/status/1948152133674811518
What links here?
- Bogdan Ionut Cirstea's comment on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Tomek Korbak (28 Jul 2025 18:23 UTC; 3 points)

Bogdan Ionut Cirstea 16 Jun 2025 10:06 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
I would love to see an AI safety R&D category.
My intuition is that quite a few crucial AI safety R&D tasks are probably much shorter-horizon than AI capabilities R&D, which should be very helpful for automating AI safety R&D relatively early. E.g. the compute and engineer-hours time spent on pretraining (where most capabilities [still] seem to be coming from) are a-few-OOMs larger than those spent on fine-tuning (where most intent-alignment seems to be coming from).

Bogdan Ionut Cirstea 9 Jun 2025 18:48 UTC
7 points
0
on: METR’s Observations of Reward Hacking in Recent Frontier Models
Seems pretty minor for now though:

The actual cheating behavior METR has observed seems relatively benign (if annoying). While it’s possible to construct situations where this behavior could cause substantial harm, they’re rather contrived. That’s because the model reward hacks in straightforward ways that are easy to detect. When the code it writes doesn’t work, it’s generally in a way that’s easy to notice by glancing at the code or even interacting with the program. Moreover, the agent spells out the strategies it’s using in its output and is very transparent about what methods it’s using.
Inasmuch as reward hacking is occurring, we think it’s good that the reward hacking is very obvious: the agents accurately describe their reward hacking behavior in the transcript and in their CoT, and the reward hacking strategies they use typically cause the programs they write to fail in obvious ways, not subtle ones. That makes it an easier-to-spot harbinger of misalignment and makes it less likely the reward hacking behavior (and perhaps even other related kinds of misalignment) causes major problems in deployment that aren’t noticed and addressed.

Bogdan Ionut Cirstea 27 May 2025 18:42 UTC
8 points
2
on: Does the Universal Geometry of Embeddings paper have big implications for interpretability?
Yes, I do think this should be a big deal, and even more so for monitoring (than for understanding model internals). It should also have been at least somewhat predictable, based on theoretical results like those in I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? and in All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling.

Bogdan Ionut Cirstea 11 May 2025 17:11 UTC
8 points
2
on: A “Bitter Lesson” Approach to Aligning AGI and ASI
I suspect you’d be interested in this paper, which seems to me like a great proof of concept: Safety Pretraining: Toward the Next Generation of Safe AI.

Bogdan Ionut Cirstea

1. There’s likely to be lots of AI safety money becoming available in 1–2 years