RSS

Seth Herd

Karma: 3,471

Research overview:

If you don’t already know the arguments for why aligning AGI is probably the most important and pressing question of our time, please see this excellent intro. There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that are readily implementable for the types of AGI we’re most likely to develop first.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I’ve focused on the emergent interactions that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my best theories. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute. My primary email is seth dot herd at gee mail dot com.

More on approach

I think that the field of AGI alignment is “pre-paradigmatic:” we don’t know what we’re doing yet. We don’t have anything like a consensus on what problems need to be solved, or how to solve them. So I spend a lot of my time thinking about this, in relation to specific problems and approaches. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with zero episodic memory and very little executive function for planning and goal-directed self-control. Adding those capabilities and others might expand LLMs into working cognitive architectures with human-plus abilities in all relevant areas. My work since then has convinced me that we could probably also align such AGI/​ASI to keep following human instructions, by putting such a goal at the center of their decision-making process and therefore their “psychology”, and then using the aligned proto-AGI as a collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.