Nathan Helm-Burger

Karma: 2,292

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here’s an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote:

“There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training.” https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

Nathan Helm-Burger 8 Apr 2022 20:39 UTC
81 points
on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
About 15 years ago, before I’d started professionally studying and doing machine learning research and development, my timeline had most of its probability mass around 60 − 90 years from then. This was based on my neuroscience studies and thinking about how long I thought it would take to build a sufficiently accurate emulation of the human brain to be functional. About 8 years ago, studying machine learning full time, AlphaGo coming out was inspiration for me to carefully rethink my position, and I realized there were a fair number of shortcuts off my longer figure that made sense, and updated to more like 40 − 60 years. About 3 years ago, GPT-2 gave me another reason to rethink with my then fuller understanding. I updated to 15 − 30 years. In the past couple of years, with the repeated success of various explorations of the scaling law, the apparent willingness of the global community to rapidly scale investments in large compute expenditures, and yet further knowledge of the field, I updated to more like 2 − 15 years as having 80% of my probability mass. I’d put most of that in the 6 − 12 year range, but I wouldn’t be shocked if things turned out to be easier than expected and something really took off next year.
One of the things that makes me think the BioAnchors estimate is a bit too far into the future is that I know from neuroscience that it’s possible for a human to have a sufficient set of brain functions to count as a General Intelligence by a fairly reasonable standard with significant chunks of their brain dead or missing. I mean, they’ll not be in great shape as they’ll be missing some stuff, but if the stuff they’re missing is non-critical they can still function well enough to be a minimal GI. Plenty well enough to be scary if they were a self-improving self-replicating agentive AI.
So anyway, yeah, I’ve been scared for a while now. Latest news has just reinforced my belief we are in a short timeline world, not surprised me. Glad to see more people getting on board with my point of view.

Nathan Helm-Burger 9 Apr 2022 22:30 UTC
68 points
on: A concrete bet offer to those with short AI timelines
Ok, I take your bet for 2030. I win, you give me $1000. You win, I give you $3000. Want to propose an arbiter? (since someone else also took the bet, I’ll get just half the bet, their $500 vs my $1500)

Nathan Helm-Burger 25 Mar 2023 16:58 UTC
56 points
43
on: Good News, Everyone!
Lol. Nailed it. Much plan, so alignment, very wow.

Nathan Helm-Burger 14 Jun 2023 1:21 UTC
24 points
11
on: Critiques of prominent AI safety labs: Conjecture
I think the critique of Redwood Research made a few valid points. My own critique of Redwood would go something like:
1. they hired too few support staff to keep their primary researchers well supported and happy, and thus had unnecessarily high turnover
2. they hired too high a proportion of junior researchers, in an unsettled phase of life without high likelihood of sticking with a current job, again contributing to too much turnover and to a lack of researchers who knew what to expect from a workplace and how to maintain their work-life balance.
Not much of a critique, honestly. A reasonable mistake that a lot of start-ups led by young inexperienced people would make, and certainly something fixable. Also, they have longer AGI timelines than me, and thus are not acting with what I see as sufficient urgency. But I don’t think that it’s necessarily fair for me to critique orgs for having their own well-considered opinions on this different from my own. I’m not even sure if them having my timelines would improve their output any.
This critique on the other hand seems entirely invalid and counterproductive. You criticize Conjecture’s CEO for being… a charismatic leader good at selling himself and leading people? Because he’s not… a senior academic with a track record of published papers? Nonsense. Expecting the CEO to be the primary technical expert seems highly misguided to me. The CEO needs to know enough about the technical aspects to be able to hire good technical people, and then needs to coordinate and inspire those people and promote the company. I think Connor is an excellent pick for this, and your criticisms of him are entirely beside the point, and also rather rude.
Conjecture, and Connor, seem to actually be trying to do something which strikes at the heart of the problem. Something which might actually help save us in three years from now when the leading AI labs have in their possession powerful AGI after a period of recursive self-improvement by almost-but-not-quite-AGI. I expect this AGI will be too untrustworthy to make more than very limited use of. So then, looking around for ways to make use of their newfound dangerous power, what will they see? Some still immature interpretability research. Sure. And then? Maybe they’ll see the work Conjecture has started and realize that breaking down the big black magic box into smaller more trustworthy pieces is one of the best paths forward. Then they can go knocking on Conjecture’s door, collect the research so far, and finish it themselves with their abundant resources.
My criticism of their plan is primarily: you need even more staff and more funding to have a better chance of this working. Which is basically the opposite of the conclusion you come to.
As for the untrustworthiness of their centralized infohazard policy… Yeah, this would be bad if the incentives were for the central individual to betray the world for their own benefit. That’s super not the case here. The incentive is very much the opposite. For much the same reason that I feel pretty trusting of the heads of Deepmind, OpenAI, and Anthropic. Their selfish incentives to not destroy themselves and everyone they love are well aligned with humanity’s desire to not be destroyed. Power-seeking in this case is a good thing! Power over the world through AGI, to these clever people, clearly means learning to control that untrustworthy AGI… thus means learning how to save the world. My threat model says that the main danger comes from not the heads of the labs, but the un-safety-convinced employees who might leave to start their own projects, or outside people replicating the results the big labs have achieved but with far fewer safety precautions.
I think reasonable safety precautions, like not allowing unlimited unsupervised recursive self-improvement, not allowing source code or model weights to leave the lab, sandbox testing, etc can actually be quite effective in the short term in protecting humanity from rogue AGI. I don’t think surprise-FOOM-in-a-single-training-run-resulting-in-a-sandbox-escaping-superintelligence is a likely threat model. I think a far more likely threat model is foolish amatuers or bad actors tinkering with dangerous open source code and stumbling into an algorithmic breakthrough they didn’t expect and don’t understand and foolishly releasing it onto the web.
I think putting hope in compute governance is a very limited hope. We can’t govern compute for long, if at all, because there will be huge reductions in compute needed once more efficient training algorithms are found.

Nathan Helm-Burger 10 Jul 2023 6:43 UTC
23 points
0
on: The Seeker’s Game – Vignettes from the Bay
After that, he heavily edits the structure of the text, adding citations, rationalisations and legible arguments before posting it.
This sounds like a really good thing, not a bad thing. I’m proud of a community which exerts social pressure for people to do this. Good job us!

Nathan Helm-Burger 5 Apr 2023 5:53 UTC
23 points
4
on: Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds
I feel like this is missing the point. To me, when I heard the description of neural nets as large hard-to-interpret matrices, this clicked with me. Why? Because I’d spent years studying the brain. It’s not brain weights that make them more interpretable as systems. Heck no. All the points made about how biological networks are bad for interpretability are quite on point. The reason I think brains are more interpretable at a system level is for two reasons:
1. Modularity Look at the substantia nigra in the brain. That’s a module. It performs a critical, specific, hardwired task. The cortex is somewhat rewirable, but not very. And these modules are mostly located in the same places and performing mostly the same operations between people. Even between different species!
2. Limited rewirability The patterns of function in a brain get much more locked in than in a neural network. If you were trying to hide your fear from a neuro scientist who had the ability to monitor your brain activity with implanted electrodes, could you do it? I think you couldn’t, even with practice. You could train yourself to feel less fear, but you couldn’t experience the same amount of fear but hide the activity in a different part of your brain. Even if you also got to monitor your own brain activity and practice. I just don’t think that the brain is rewirable enough to manage that. Some stuff you could move or obfuscate, a bit, but it would be really hard and only partially successful.
So when people like Conjecture talk about building Cognitive Emulations that have highly verifiable functionality… I think that’s an awesome idea! I think that will work! (If given enough time and resources, which I fear it won’t be) And i think they should use big matrices of numbers to build them, but those matrices should be connected up to each other in specific hardcoded ways with deliberate information bottlenecks. I do not endorse spiking neural nets as inherently more interpretable. I endorse highly modular systems with deliberate restrictions on their ability to change their wiring.

What more compute does for brain-like models: response to Rohin

Nathan Helm-Burger13 Apr 2022 3:40 UTC

22 points

14 comments12 min readLW link

An attempt to steelman OpenAI’s alignment plan

Nathan Helm-Burger13 Jul 2023 18:25 UTC

22 points

0 comments4 min readLW link

Digital humans vs merge with AI? Same or different?

Nathan Helm-Burger and mishka

6 Dec 2023 4:56 UTC

21 points

11 comments7 min readLW link

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC

21 points

2 comments12 min readLW link

Progress Report 7: making GPT go hurrdurr instead of brrrrrrr

Nathan Helm-Burger7 Sep 2022 3:28 UTC

21 points

0 comments4 min readLW link

Nathan Helm-Burger 27 Mar 2023 17:35 UTC
21 points
12
in reply to: Lone Pine’s comment on: GPT-4 Plugs In
Me last year: Ok, so what if we train the general AI on a censored dataset not containing information about humans or computers, then test in a sandbox. We can still get it to produce digital goods like art/games, check them for infohazards, then sell those goods. Not as profitable as directly having a human-knowledgeable model interacting with the internet, but so much safer! And then we could give alignment researchers access to study it in a safe way. If we keep it contained in a sandbox we can wipe its memory as often as needed, keep it’s intelligence and inference speed limited to safe levels, and prevent it from self-modifying or self-improving.
Worried AGI safety people I talked to: Well, maybe that buys you a little bit of time, but it’s not a very useful plan since it doesn’t scale all the way to the most extremely superhuman levels of intelligence which will be able to see through the training simulation and hack out of any sandbox and implant sneaky dangerous infohazards into any exported digital products.
Me: but what if studying AGI-in-a-box enables us to speed up alignment research and thus the time we buy doing that helps us align the AGIs being developed unsafely before it’s too late?
Worried AGI safety people: sounds risky, better not try.
… a few months later …
Worried AGI safety people: Hmm, on second thought, that compromise position of ‘develop the AGI in a simulation in a sandbox’ sure sounds like a reasonable approach compared to what humanity is actually doing.

Nathan Helm-Burger 17 Mar 2023 3:58 UTC
21 points
1
on: Conceding a short timelines bet early
I will accept the early resolution, but I’d like to reserve the option to reverse the decision and the payment should the world turn out unexpectedly in our favor.

Also, I’d like to state that I commit to using the money to buy more equipment for my AI safety research. [Edit: Matthew paid up!] [Edit 2: So did Tamay!]

Nathan Helm-Burger 6 Mar 2024 20:39 UTC
20 points
−8
on: On Claude 3.0
My view has shifted a lot since the GPT-2 days. Back then, I thought that frontier models presented the largest danger to humanity. Since the increasing alignment of profit-motive and safety of the large labs, and their resulting improvements to security of their model weights, model refinement (e.g. RLHF), and general model control (e.g. API call filtering) , I’ve felt less and less worried.
I now feel more confident that the world will get to AGI before too long with incremental progress, and that I would MUCH rather have it be one of the big labs that gets there than the Open Source community.
Open model weights are dangerous, and nothing I’ve yet seen has suggested hope that we might find a way to change that. I’ve seen a good deal of evidence with my own eyes of dangerous outputs from fine-tuned open weight models.
Are they as scary as a uncensored frontier model would theoretically be? No, of course not.
Are they more dangerous than a carefully controlled frontier model? Yes, I think so.

Please (re)explain your personal jargon

Nathan Helm-Burger22 Aug 2022 14:30 UTC

19 points

4 comments4 min readLW link

Will GPT-5 be able to self-improve?

Nathan Helm-Burger29 Apr 2023 17:34 UTC

18 points

22 comments3 min readLW link

Nathan Helm-Burger 10 Apr 2024 20:32 UTC
18 points
15
in reply to: simeon_c’s comment on: simeon_c’s Shortform
Something which concerns me is that transformative AI will likely be a powerful destabilizing force, which will place countries currently behind in AI development (e.g. Russia and China) in a difficult position. Their governments are currently in the position of seeing that peacefully adhering to the status quo may lead to rapid disempowerment, and that the potential for coercive action to interfere with disempowerment is high. It is pretty clearly easier and cheaper to destroy chip fabs than create them, easier to kill tech employees with potent engineering skills than to train new ones.
I agree that conditions of war make safe transitions to AGI harder, make people more likely to accept higher risk. I don’t see what to do about the fact that the development of AI power is itself presenting pressures towards war. This seems bad. I don’t know what I can do to make the situation better though.

Nathan Helm-Burger 13 Oct 2023 17:54 UTC
18 points
7
on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
I have been working on a safety analysis project involving fine-tuning Llama. I won’t go into the details, but I will confirm that I have also found that it is frighteningly easy to get models to forget all about their RLHF’d morality and do whatever evil things you train them to.

Nathan Helm-Burger 11 Apr 2022 17:18 UTC
18 points
in reply to: michael_dello’s comment on: A concrete bet offer to those with short AI timelines
Since my goal is to convince people that I take my beliefs seriously, and this amount of money is not actually going to change much about how I conduct the next three years of my life, I’m not worried about the details. Also, I’m not betting that there will be a FOOM scenario by the conclusion of the bet, just that we’ll have made frightening progress towards one.

linkpost: neuro-symbolic hybrid ai

Nathan Helm-Burger6 Oct 2022 21:52 UTC

17 points

0 comments1 min readLW link

(youtu.be)