My primary email is seth dot herd at gee mail dot com (or message me here where I won’t miss it amongst spam).
In short, I’m applying my conclusions from 23 years of research in the computational cognitive neuroscience of complex human thought to the study of AI alignment.
I’m exhilarated and a bit frightened by it.
Research overview:
Alignment is the study of how we can make sure our AI’s goals are aligned with humanity’s goals. So far, AIs haven’t really had goals, nor been smart enough to worry about, so this can sound like paranoia or science fiction. But recent breakthroughs in AI make it quite possible that we’ll have genuinely smarter-than-human AIs with their own goals sooner than we’re ready for them. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way, possibly much to our chagrin. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.
That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could actually accomplish that for all of humanity. So I focus on finding alignment solutions.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
Bio
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I’ve focused on the emergent interactions that are needed to explain complex thought. Here’s a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my best theories. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
More on approach
I think that the field of AGI alignment is “pre-paradigmatic:” we don’t know what we’re doing yet. We don’t have anything like a consensus on what problems need to be solved, or how to solve them. So I spend a lot of my time thinking about this, in relation to specific problems and approaches. Solving the wrong problems seems like a waste of time we can’t afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with zero episodic memory and very little executive function for planning and goal-directed self-control. Adding those capabilities and others might expand LLMs into working cognitive architectures with human-plus abilities in all relevant areas. My work since then has convinced me that we could probably also align such AGI/ASI to keep following human instructions, by putting such a goal at the center of their decision-making process and therefore their “psychology”, and then using the aligned proto-AGI as a collaborator in keeping it aligned as it grows smarter.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is probably far overstating their certainty.
After reading all the comments threads, I think there’s some framing that hasn’t been analyzed adequately:
Why would humans be testing AGIs this way if they have the resources to create simulation that will fool a super intelligence?
Also, the risk of humanity being wiped out seems different and worse while that asi is attempting a takeover—during that time the humans are probably an actual threat.
Finally, leaving humans around would seem to pose a nontrivial risk that they’ll eventually spawn a new ASI that could threaten the original.
The Dyson sphere is just a tiny part of the universe so using that as the fractional cost seems wrong. Other considerations in both directions would seem to dominate it.