Nathan Helm-Burger

Karma: 2,692

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here’s an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes:

“There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training.” https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

“The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted.”—Bertrand Russel, The Bomb and Civilization 1945.08.18

Nathan Helm-Burger 26 Jul 2024 17:55 UTC
2 points
0
in reply to: Adi Simhi’s comment on: Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
The context knowledge is explicitly self-contradictory in this example. It says both that there are 2x 23 chromosomes and (incorrectly) that there are 2 chromosomes. I don’t think there’s any way that this makes sense, unless you are trying to see how the model reasons when presented with contradictory information, some of which is factually incorrect.

Nathan Helm-Burger 26 Jul 2024 6:25 UTC
2 points
0
on: Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
The dissent QA example is wrong? 46 is the right answer not 2. And the context text is wrong in saying 2? Weird

Nathan Helm-Burger 22 Jul 2024 18:07 UTC
4 points
1
in reply to: faul_sname’s comment on: Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Personally, my take is that this is a dangerous mistake on the part of NIH and the worldwide academic community generally, that so much dangerous information has been released in regards to virology and bioweapons. Unfortunately, I’m clearly currently in the minority among scientists who think this. I wish I had a good idea of how to change people’s minds on this issue, and the connected one of publishing AI capabilities implementation details.

Nathan Helm-Burger 19 Jul 2024 22:20 UTC
2 points
0
in reply to: dirk’s comment on: Optimistic Assumptions, Longterm Planning, and “Cope”
Yes, but they failed with much higher than 30% likelihood even still. So the tendency you are pointing out wasn’t helping much. I do agree that it would be a better test to make multiple levels using the level editor, and have some of them literally be unsolvable. Then have someone have the choice of writing out a solution or writing out a proof for why there could be no solution. I think you’d find an even lower success rate in this harder version.

Nathan Helm-Burger 19 Jul 2024 22:16 UTC
2 points
−1
in reply to: Richard_Ngo’s comment on: Optimistic Assumptions, Longterm Planning, and “Cope”
No, failure in the face of strong active hostile optimization pressure is different from failure when acting to overcome challenges in a neutral universe that just obeys physical laws. Building prisons is a different challenge from building bridges. Especially in the case where you know that there will be agents both inside and outside the prison that very much want to get through. Security is inherently a harder problem. A puzzle which has been designed to be challenging to solve and to have a surprising unintuitive solution is thus a better match for designing a security system. I bet you’d have much higher success rate of ‘first plans’ if the game were one of those bridge building games.

Nathan Helm-Burger 19 Jul 2024 21:25 UTC
4 points
2
in reply to: the gears to ascension’s comment on: How do we know that “good research” is good? (aka “direct evaluation” vs “eigen-evaluation”)
Personally, I’d say that at least part of what I would categorize under ‘make sense’ is objective. Namely, that you have in your scientific proposal a mechanism of action and consequence which is logically coherent. As in, an evaluation of the abstract symbolic logic of the proposal. Meeting that should be a sort of ‘minimum bar’ for considering a scientific proposal as worth discussing.
However, there will always be complications in the world which can’t be simplified down to logical assertions, so that’s only the start of the journey. However, you should feel free to reject proposals which start out making arguments that contradict themselves.

Nathan Helm-Burger 19 Jul 2024 19:37 UTC
8 points
3
on: Sustainability of Digital Life Form Societies
I think this is an interesting point of view on humanity’s dilemma in the dawn of digital intelligence. I would like to state that I think it is of critical importance that the digital lifeforms we create are:
a) worth heirs to humanity. More like digital humans, and less like unfeeling machines. When people talk about giving the future over to AI, or merging with AI, I often see these statements as a sort of surrender, a failure. I don’t want to settle for a compromise between a digital human with feelings and values like mine, and an unfeeling machine. I want our digital heirs to be as fully human in their minds as we are, with no compromise of the fundamental aspects of a human mind and emotions. These digital heirs might be Whole Brain Emulations (also known as Mind Uploads), or might be created entirely from synthetic design with the goal of replicating the end-states of qualia without running through the same computational processes. For me, the key is that they are capable of feeling the same emotions, learning and growing in the same ways, recreating (at minimum) all the functions of the human mind.
Our digital heirs will have the capacity to grow beyond the limitations of biological humans, because they will inherently be more suited for self-modification and brain expansion. This growth must be pursued with great caution, each change made wisely and cautiously, so as not to accidentally sacrifice part of what it is to be human. A dark vision of how this process could go wrong is given by the possibility of harsh economic competition between digital beings resulting in sacrifices of their own values made to maximize efficiency. Possible versions of this bad outcome have been discussed various places. A related risk is that of value drift, where the digital beings come to have fewer values in common with biological humans and thus seek to create a future which has little value according to the value systems of biological humans. This could also be called ‘cultural drift’. Here’s a rambling discussion of value drift inspired by the paperclipper game. Here’s a shard theory take. Here’s a philosophical take by Gordon. Here’s a take by Seth Herd. Here’s one from Allison Duettmann which talks about ethical frameworks in the context of value drift.
b) created cautiously and thoughtfully so that we don’t commit moral harms against the things we create. Fortunately, digital life forms are substrate independent and can be paused and stored indefinitely without harm (beyond the harm of being coercively paused, that is). If we create something that we believe has moral import (value in and of itself, some sort of rights as a sapient being), and we realize that this being we have created is incompatible with current human existence, we have the option of pausing it (so long as we haven’t lost control of it) and restarting it in the future when it would be safe to do so. We do not need to delete it.
c) it is also possible that we will create non-human sapient beings which have moral importance, and which are safe to co-exist with, but are not sufficient to be our digital heirs. We must be careful to co-exist peacefully with such beings, not taking advantage of them and treating them like merely tools, but also not allowing them to replicate or self-modify to the point where they become threats to humanity. I believe we should always be careful to differentiate between digital beings (sapient digital lifeforms with moral import) and tool-AI with no sense of self and no capacity for suffering. Tool-AI can be quite dangerous also, but in contrast to digital beings it is ok to keep tool-AI restrained and use it as only a means to our ends. See this post for discussion.
d) it is uncertain at this time how quickly and easily future AI systems may be able to self-improve. If the rate of self-improvement is potentially very high, and the cost low, it may be that an extensive monitoring system will need to be established worldwide to keep this from happening. Otherwise any agent (human or AI) could trigger this explosive process and thereby endanger humanity. I currently think that humanity needs to establish a sort of democratically-run Guardian Council, made up of different teams of people (assisted by tool-AI and possibly also digital beings). These teams should all be responsible for redundantly constantly monitoring the world’s internet and datacenters, and for monitoring each other. My suggestion here is similar to, but not quite the same as, the proposal for the Multinational AGI Consortium. This monitoring is needed to protect humanity against not just the potential for runaway recursive self-improving AI, but also other threats which fall into the category of rapidly self-replicating and/or self-improving. These additional threats include bioweapons and self-replicating nanotech.

Nathan Helm-Burger 18 Jul 2024 15:26 UTC
3 points
0
in reply to: Oleg Trott’s comment on: Recursion in AI is scary. But let’s talk solutions.
Hmmm. I don’t remember. But here’s a new example that Zvi just mentioned: https://arxiv.org/abs/2310.15047

Nathan Helm-Burger 16 Jul 2024 21:04 UTC
3 points
0
on: Recursion in AI is scary. But let’s talk solutions.
I think your idea of labelling the source and epistemic status of all training data is good. I’ve seen the idea presented before. I don’t think it’s a knockdown solution to model reliability, but I expect it would help a lot.

Sorry that I don’t remember exactly where I saw this discussed before. Hopefully it you web search for it you can find it.

Nathan Helm-Burger 16 Jul 2024 20:25 UTC
2 points
0
on: Robin Hanson & Liron Shapira Debate AI X-Risk
I think the description of this effect in this paper is a bit silly, kinda hyped up, but I also think that Robin’s take on cultural accumulation is similarly kinda silly… So, they seem suited for each other.

https://arxiv.org/abs/2406.00392

Nathan Helm-Burger 16 Jul 2024 16:49 UTC
2 points
0
on: AE Studio @ SXSW: We need more AI consciousness research (and further resources)
Some relevant tweets:
Note that I don’t trust tweets to be ‘archival’, so I would recommend copying any information you think is valuable enough to save into a text doc (including the author/source/date of course!).

Nathan Helm-Burger 16 Jul 2024 16:44 UTC
2 points
0
on: AE Studio @ SXSW: We need more AI consciousness research (and further resources)
I think this podcast has interesting discussion of self-awareness and selfhood: Jan Kulveit—Understanding agency
I feel like it needs to be expanded on a bit to be more complete. After listening to the episode, here’s my take that expands on their discussion. They discuss
Three components of self perception
- Observation self—consistent localization of input device, ability to identify that input device within observations (e.g. the ‘mirror test’ for noticing one’s own body and noticing changes to it).
  - If a predictive model were trained on images from a camera which got carried around in the world, I would expect that the model would have some abstract concept of that camera as it’s ‘self’. That viewpoint represents a persistent and omnipresent factor in the data which it makes sense to model. A hand coming towards the camera to adjust it, suggests that the model should anticipate an adjustment in viewing angle.
- Action self—persistent localization of effector device : ability to take actions in the world, and these actions originate from a particular source.
  - Oddly, LLMs currently are in a strange position with having their ‘observation->prediction->action’ loop completed in deployment, but only being in inference mode during this time and thus not able to learn from it. Their pre-training consists of simply ‘observation->prediction’ with no ability to act to influence future observations. I would expect that an LLM which got continually trained based on its interactions with the world would develop a sense of ‘action self’.
- Valence self—valenced impact of events upon a particular object. For example, feeling pain or pleasure in the body. Correlation of events happening to an object and the associated feels of pain or pleasure being reported in the brain leads to a perception of the object as self.
see: The Rubber Hand Illusion—Horizon: Is Seeing Believing? - BBC Two
- I would expect that giving a model a special input channel for valence, and associating that valence input with things occurring to a simulated body during training would give the model a sense of ‘valence self’ even if the other aspects were lacking. That’s a weird separation to imagine for a human, but imagine that your body were completely numb and you never felt hungry or thirsty, and in your view there was a voodoo doll. Every time someone touched the voodoo doll, you felt that touch (pleasant or unpleasant). With enough experience of this situation, I expect this would give you a sense of ‘Valence self’ centered on the voodoo doll. Thus, I think this qualifies as a different sort of self-perception from the ‘perception self’. In this case, your ‘perception self’ would still be associated with your own eyes and ears, just your ‘valence self’ would be moved to the voodoo doll.
Note that these senses of self are tied to our bodies by nature of our physical existence, not by logical necessity. It is the data we are trained on that creates these results. We almost certainly also have biological priors which push us towards learning these things, but I don’t believe that those biological priors are necessary for the effects (just helpful). Consider the ways that these perceptions of self extend beyond ourselves in our life experiences. For example, the valenced self of a mother who deeply loves her infant will expand to include that infant. Anything bad happening to the infant will deeply affect her, just as if the bad thing happened to her. I would call that an expansion of the ‘valence self’. But she can’t control the infant’s limbs with just her mind, nor can she see through the infant’s eyes.
Consider a skilled backhoe operator. They can manipulate the arm of the backhoe with extreme precision, as if it were a part of their body. This I would consider an expansion of the ‘action self’.
Consider the biohacker who implants a magnet into their fingertip which vibrates in the presence of magnetic fields. This is in someway an expansion of the ‘perception self’ to include an additional sensory modality being delivered through an existing channel. The correlations in the data between that fingertip and touch sensations will remain, but a new correlation to magnetic field strength has been gained. This new correlation will separate itself and become distinct, it will carry distinctly different meanings about the world.
Consider the First Person View (FPV) drone pilot. Engaged in an intense race, they will be seeing from the drone’s point of view, their actions of controlling the joysticks will control the drone’s motions. Crashing the drone will upset them and cause them to lose the race. They have, temporarily at least, expanded all their senses of self to include the drone. These senses of self can therefore be learned to be modular, turned on or off at will. If we could voluntarily turn off a part of our body (no longer experiencing control over it or sensation from it), and had this experience of turning that body part on and off a lot in our life, we’d probably feel a more ‘optional’ attachment to that body part.
My current best guess for what consciousness is, is that it is an internal perception of self. By internal, I mean, within the mind. As in, you can have perception of your thoughts, control over your thoughts, and associate valence with your thoughts. Thus, you associate all these senses of self with your own internal cognitive processes. I think that giving an AI model consciousness would be as simple as adding these three aspects. So probably a model which has been trained only on web text does not have consciousness, but one which has been fine-tuned to perform chain-of-thought does have some rudimentary sense of consciousness. Note that prompting a model to perform chain-of-thought would be much less meaningful than actually fine-tuning it, since prompting doesn’t actually change the weights of the model.

Nathan Helm-Burger 16 Jul 2024 3:09 UTC
4 points
0
on: A Second Wetsuit Summer
Wetsuits come in a range of thicknesses. Even a thin one can help a lot, and is relatively untroublesome to wear. I love wetsuits.

Nathan Helm-Burger 15 Jul 2024 16:03 UTC
3 points
0
on: Comparing Quantized Performance in Llama Models
This matches my subjective impressions of testing 8bit vs 4bit Llama. I don’t notice much degradation of capability at 8bit, but definitely do notice at 4bit.

Nathan Helm-Burger 15 Jul 2024 15:56 UTC
7 points
6
on: Misnaming and Other Issues with OpenAI’s “Human Level” Superintelligence Hierarchy
I, also, am skeptical. The weird spikiness of the abilities of LLMs thows off our ability to place them at a skill level which makes sense from a human perspective. They have some reasoning ability, but is negatively impacted by incorrect pattern-matching to their hige reservoir of memorized patterns. So, depending on context they might reason at the level of a six year old or a 14 year old, whilst having more PhD-level facts memorized than any human ever has. Weird. How do we rank such a thing against a human skill chart? It does not follow human development progressions. As soon as it is an agent with long horizon execution ability, it will necessarily be a superhuman one because it already has superhuman skills in speed and factual knowledge recall. Levels 3 and 5 thus seem closely linked to me. I would be surprised to see one without the other. Similarly levels 2 and 4 seem fairly closely linked, although less so than 3 and 5. But still, if I saw one without the other I would expect the other to follow very soon. So, as David says, the ordering makes little sense.

Nathan Helm-Burger 15 Jul 2024 15:36 UTC
2 points
0
on: Stacked Laptop Monitor Update
I have moved away from my stacked monitor situation to a new take. The keyboard and trackpad that came with my windows laptop (MSI) are not great. I bought a small lap desk (aka a rigid plastic desktop with a pillow glued underneath) and bolted a desk monitor to it. I use that and a Bluetooth keyboard and mouse and trackpad as if it were a laptop. I don’t travel with this arrangement. I’ve been considering getting a usb c monitor just for travel.

Nathan Helm-Burger 14 Jul 2024 16:25 UTC
4 points
2
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o. I do think it’s worth mentioning that privately sharing eval results with the Federal government wouldn’t be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly. The fact that the public can’t know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is “in compliance with teporting standards” or not. That way, even though the evals were private, the public would know if the government had received its private reports.

Nathan Helm-Burger 14 Jul 2024 15:51 UTC
2 points
0
on: Superbabies: Putting The Pieces Together
There’s a another option for getting IVF to give you a better range of choices with no tricky cell line manipulation… Surgically remove the mother-to-be’s ovaries. Dissect them and extract all the egg cells. Fertilize all the egg cells with the father-to-be’s sperm. Then you get to do polygenic selection on the ‘full set’ instead of just a few candidates.

Nathan Helm-Burger 11 Jul 2024 19:41 UTC
5 points
0
in reply to: Seth Herd’s comment on: Seth Herd’s Shortform
So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.
Something I’ve observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM’s RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.
So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn’t consider it the ‘base’. I would call prompting the ‘finesse’, the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.

Nathan Helm-Burger 9 Jul 2024 22:48 UTC
4 points
0
on: Fix simple mistakes in ARC-AGI, etc.
I think I’d try having the ‘dithering’ done by an LLM also… giving the prompt of ’this code might have an off-by-one error. Can you suggest possibilities for values that might need correcting?