Seth Herd

Karma: 6,862

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I’m applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you’re new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we’re not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we’ll have smarter-than-human AIs soon. So we’d better get ready. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.

More on approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll continue doing with the alignment target developers currently use: Instruction-following. It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Instruction-following AGI is easier and more likely than value aligned AGI

Seth HerdMay 15, 2024, 7:38 PM

80 points

28 comments12 min readLW link

Goals selected from learned knowledge: an alternative to RL alignment

Seth HerdJan 15, 2024, 9:52 PM

42 points

18 comments7 min readLW link

After Alignment — Dialogue between RogerDearnaley and Seth Herd

RogerDearnaley and Seth Herd

Dec 2, 2023, 6:03 AM

15 points

2 comments25 min readLW link

Corrigibility or DWIM is an attractive primary goal for AGI

Seth HerdNov 25, 2023, 7:37 PM

19 points

4 comments1 min readLW link

Sapience, understanding, and “AGI”

Seth HerdNov 24, 2023, 3:13 PM

15 points

3 comments6 min readLW link

Altman returns as OpenAI CEO with new board

Seth HerdNov 22, 2023, 4:04 PM

6 points

3 comments1 min readLW link

OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns

Seth HerdNov 20, 2023, 2:20 PM

52 points

28 comments1 min readLW link

(www.wired.com)

We have promising alignment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM

44 points

9 comments5 min readLW link

Seth Herd’s Shortform

Seth HerdNov 10, 2023, 6:52 AM

6 points

59 comments LW link

Shane Legg interview on alignment

Seth HerdOct 28, 2023, 7:28 PM

66 points

20 comments2 min readLW link

(www.youtube.com)

The (partial) fallacy of dumb superintelligence

Seth HerdOct 18, 2023, 9:25 PM

38 points

5 comments4 min readLW link

Steering subsystems: capabilities, agency, and alignment

Seth HerdSep 29, 2023, 1:45 PM

31 points

0 comments8 min readLW link

AGI isn’t just a technology

Seth HerdSep 1, 2023, 2:35 PM

18 points

12 comments2 min readLW link

Internal independent review for language model agent alignment

Seth HerdJul 7, 2023, 6:54 AM

55 points

30 comments11 min readLW link

Simpler explanations of AGI risk

Seth HerdMay 14, 2023, 1:29 AM

8 points

9 comments3 min readLW link

A simple presentation of AI risk arguments

Seth HerdApr 26, 2023, 2:19 AM

19 points

0 comments2 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth HerdApr 18, 2023, 4:29 PM

88 points

18 comments20 min readLW link

Agentized LLMs will change the alignment landscape

Seth HerdApr 9, 2023, 2:29 AM

160 points

102 comments3 min readLW link 1 review

AI scares and changing public beliefs

Seth HerdApr 6, 2023, 6:51 PM

46 points

21 comments6 min readLW link

The alignment stability problem

Seth HerdMar 26, 2023, 2:10 AM

35 points

15 comments4 min readLW link