RSS

Seth Herd

Karma: 8,530

I studied complex human thought using neural network models of brain function for about 23 years. Now I’m working on on alignment for AI as it is made more able to “think for itself.” Below is an index to my work. Message me here, I’ll respond!

I work on technical alignment, but doing that has led me to also work on alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

Research Overview:

Alignment is the study of how to align the goals of advanced AI with the goals of humanity, so we’re not in competition with our own creations. This is tricky because we are creating AI by training it, not programming it. So it’s a bit like trying to train a dog to eventually run the world. It might work, but wouldn’t want to just hope.

Large language models like ChatGPT constitute a breakthrough in AI. We might have AIs more competent than humans in every way, fairly soon. Such AI will outcompete us quickly or slowly. We can’t expect to stay around long unless we carefully build AI so that it cares a lot about our well-being or at least our instructions. See this excellent intro video if you’re not familiar with the alignment problem.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they’re about as autonomous and competent as a human, but then it gets really dicey. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did research in what we called “computational cognitive neuroscience” from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

Since about 2004 I’ve been concerned with AGI applications of the research, and increasingly reluctant to publish my full theories lest they be used to accelerate AI progress. I’m excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it’s a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I’m trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions. Competition and race dynamics make the probem much harder, and conflicting incentives and group polarization create motivated reasoning that distorts beliefs.

I think it’s fairly likely that alignment isn’t impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll probably continue with the alignment target developers currently focus on: Instruction-following.

It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it’s still in our control, if we’ve gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that’s the target we’ll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios.

I think we need to resist motivated reasoning and accept the uncomfortable truth that we collectively don’t understand the alignment problem as we actually face it well enough yet. But we might understand it well enough in time if we work together and strategically.

Hu­man-like metacog­ni­tive skills will re­duce LLM slop and aid al­ign­ment and capabilities

Seth Herd12 Feb 2026 19:38 UTC
44 points
15 comments18 min readLW link

Broad­en­ing the train­ing set for alignment

Seth Herd5 Jan 2026 17:30 UTC
40 points
11 comments9 min readLW link

A coun­try of alien idiots in a dat­a­cen­ter: AI progress and pub­lic alarm

Seth Herd7 Nov 2025 16:56 UTC
93 points
15 comments11 min readLW link

LLM AGI may rea­son about its goals and dis­cover mis­al­ign­ments by default

Seth Herd15 Sep 2025 14:58 UTC
75 points
7 comments38 min readLW link

Prob­lems with in­struc­tion-fol­low­ing as an al­ign­ment target

Seth Herd15 May 2025 15:41 UTC
56 points
14 comments10 min readLW link

An­thro­po­mor­phiz­ing AI might be good, ac­tu­ally

Seth Herd1 May 2025 13:50 UTC
35 points
6 comments3 min readLW link

LLM AGI will have mem­ory, and mem­ory changes alignment

Seth Herd4 Apr 2025 14:59 UTC
78 points
15 comments9 min readLW link

Whether gov­ern­ments will con­trol AGI is im­por­tant and neglected

Seth Herd14 Mar 2025 9:48 UTC
29 points
2 comments9 min readLW link

[Question] Will LLM agents be­come the first takeover-ca­pa­ble AGIs?

Seth Herd2 Mar 2025 17:15 UTC
37 points
10 comments1 min readLW link

OpenAI re­leases GPT-4.5

Seth Herd27 Feb 2025 21:40 UTC
34 points
12 comments3 min readLW link
(openai.com)

Sys­tem 2 Align­ment: De­liber­a­tion, Re­view, and Thought Management

Seth Herd13 Feb 2025 19:17 UTC
39 points
0 comments22 min readLW link

Seven sources of goals in LLM agents

Seth Herd8 Feb 2025 21:54 UTC
23 points
3 comments2 min readLW link

OpenAI re­leases deep re­search agent

Seth Herd3 Feb 2025 12:48 UTC
78 points
21 comments3 min readLW link
(openai.com)

Yud­kowsky on The Tra­jec­tory podcast

Seth Herd24 Jan 2025 19:52 UTC
71 points
39 comments2 min readLW link
(www.youtube.com)

Grat­i­tudes: Ra­tional Thanks Giving

Seth Herd29 Nov 2024 3:09 UTC
29 points
2 comments4 min readLW link

Cur­rent At­ti­tudes Toward AI Provide Lit­tle Data Rele­vant to At­ti­tudes Toward AGI

Seth Herd12 Nov 2024 18:23 UTC
19 points
2 comments4 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC
37 points
8 comments3 min readLW link

“Real AGI”

Seth Herd13 Sep 2024 14:13 UTC
20 points
20 comments3 min readLW link

Con­flat­ing value al­ign­ment and in­tent al­ign­ment is caus­ing confusion

Seth Herd5 Sep 2024 16:39 UTC
50 points
18 comments5 min readLW link

If we solve al­ign­ment, do we die any­way?

Seth Herd23 Aug 2024 13:13 UTC
81 points
130 comments4 min readLW link