Seth Herd

Karma: 3,407

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute. My primary email is seth dot herd at gee mail dot com.

Here’s a brief summary of my alignment work to date, including my current broad take and important publications:

I think that the field of AGI alignment is “pre-paradigmatic”, or in plain English, we don’t know what we’re doing yet. We don’t have anything like a consensus on what problems need to be solved, or how to solve them. So I spend a lot of my time thinking about this, and wish more people had time and interest to think really hard about the strategic picture, in connection with the important details.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels, while still following human instructions. Current LLMs are like humans with zero episodic memory and very little executive function for planning and goal-directed self-control. Adding those capabilities and others might expand LLMs into working cognitive architectures with human-plus abilities in all relevant areas. I increasingly suspect we should be actively working in this direction as our best hope of survival, since we won’t convince the whole world to pause AGI efforts.

My work roughly asks: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?

My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.

Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.