Seth Herd

Karma: 8,622

I studied complex human thought using neural network models of brain function for about 23 years. Now I’m working on on alignment for AI as it is made more able to “think for itself.” Below is an index to my work. Message me here, I’ll respond!

New to alignment? See the Research Overview section
Field veteran? See my Bio and More on My Approach at the end.

I work on technical alignment, but doing that has led me to also work on alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

On technical alignment of LLM-based AGI agents:
- Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities - perhaps making near-future systems actually useful for conceptual alignment research while still controllable
- Broadening the training set for alignment
  - Training directly on alignment-critical decisions could help
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment: Deliberation, Review, and Thought Management
  - Likely approaches for LLM AGI on the current trajectory
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- A country of alien idiots in a datacenter: AI progress and public alarm
  - The default path includes public alarm before AGI
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to align the goals of advanced AI with the goals of humanity, so we’re not in competition with our own creations. This is tricky because we are creating AI by training it, not programming it. So it’s a bit like trying to train a dog to eventually run the world. It might work, but wouldn’t want to just hope.

Large language models like ChatGPT constitute a breakthrough in AI. We might have AIs more competent than humans in every way, fairly soon. Such AI will outcompete us quickly or slowly. We can’t expect to stay around long unless we carefully build AI so that it cares a lot about our well-being or at least our instructions. See this excellent intro video if you’re not familiar with the alignment problem.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they’re about as autonomous and competent as a human, but then it gets really dicey. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did research in what we called “computational cognitive neuroscience” from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

Since about 2004 I’ve been concerned with AGI applications of the research, and increasingly reluctant to publish my full theories lest they be used to accelerate AI progress. I’m excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it’s a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I’m trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions. Competition and race dynamics make the probem much harder, and conflicting incentives and group polarization create motivated reasoning that distorts beliefs.

I think it’s fairly likely that alignment isn’t impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll probably continue with the alignment target developers currently focus on: Instruction-following.

It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it’s still in our control, if we’ve gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that’s the target we’ll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios.

I think we need to resist motivated reasoning and accept the uncomfortable truth that we collectively don’t understand the alignment problem as we actually face it well enough yet. But we might understand it well enough in time if we work together and strategically.

Seth Herd 3 Apr 2026 20:52 UTC
2 points
0
in reply to: avturchin’s comment on: On Badness of Death
Seems like you can say that death is bad in most people’s opinions, so they should avoid it if they can. I think that’s the closest to ought any argument gets you, but in any case it’s a pretty strong statement.

Seth Herd 3 Apr 2026 20:15 UTC
4 points
1
on: Sadly, The Whispering Earring
So you’re saying there’s an earring I can wear that will make me happier and more productive? That is very tempting, at the least!

Seth Herd 3 Apr 2026 3:56 UTC
2 points
0
on: We Need Positive Visions of the Future
I agree that we need victory scenarios . Mine doesn’t quite fit your mold but I’m leaving it here anyway, at least in a brief and rushed form.
This isn’t the best or most certain path. I think it’s the easiest and therefore the best to aim for.
I usually just abbreviate this as something like better than you can imagine; we’ll do almost whatever we want if we squeak by on alignment. And we can think of some really fun and exciting stuff to do.
In my positive vision of the future, we slow down and coordinate enough to get technical alignment solved. (How we might get there is below.) The first intent-aligned AGI is used to prevent construction of other ASIs, mostly by peaceful means. The AGI explained why that had to be done, and how to do it. Or perhaps it sees a way to safely allow other AGIs, and we get a nearby variant of this scenario. That unlocks a post-scarcity scenario in which material wealth is growing so rapidly that it requires very little generosity to look very generous.
Whoever winds up controlling the first ASI is not a great person, but they’re not strongly sadistic. They decide to be hailed as a hero by being decent to humanity. They don’t need to give up any power to bring a bright future. They retain control, but having an ASI doing their bidding makes generosity on an unprecedented scale require no sacrifice or effort. They do that as the obvious and easiest path.
The rules are somewhat arbitrary; perhaps we have to celebrate Samday every week. But people are allowed to do largely whatever they want. Our new god-emperor follows the obvious and easy path: giving everyone what they want, within reason, except for things that prevent others from having what they want. Ordinary goods become free. Ordinary goods include monster trucks and at least a time-share for a yacht.
Soon after, expeditions launch and vast virtualities are created with every detail perfect. All of these are governed by the God-Emperor, but more practically, by his loyal servant whom we’ll call the Servant-God. A copy of this entity goes everywhere and observes as much as it needs to to preserve the peace.
Decisions are made about how to share the vast resources available, including limits on creating new people and minds that get parts of these rights. All of this is revocable at the God-Emperor’s whim, but he is busy indulging himself in delights—like everyone else. He’ll veto some directions of progress that he probably shouldn’t, and do some things most people don’t like, but it will look far better than anything most people have ever imagined.
People will work if they want to, choosing projects to tackle collaboratively or alone. They won’t need to work; they’ll sometimes want to.
The only thing in short supply will be people to help. Everyone will be able to get the best help available from the Servant-God—unless they’ve chosen to limit themselves to getting human help.

People will go on virtual adventures that feel as real and dangerous as they can stand. They’ll block their memories in some cases to get full immersion.
Now, how did we get there?
The slowdown was a product of people taking AGI and alignment seriously once they were talking to things clearly as intelligent, agentic, and almost as unpredictable as humans. A giant public outcry reached the ears of those in power. The US congress, in a panic, passed a bill banning the creation of superintelligence and putting AGI projects under the control of a special committee. The labs were forced to share their alignment techniques, and voluntarily shared their alignment fears as they applied to the other labs.
Alignment of the first superhuman AGI was achieved. Perhaps it was close.
It was aided by human-level AIs that gave decent advice. They did that by working hard and thinking hard, not by any brilliance. They collated all of the arguments for alignment being hard, and identified the most plausible ones. Their relative agreement helped the humans focus on the biggest dangers.
At this point, there might’ve been a struggle for power.
Thankfully, this worked out with the first AGI working for a not-great but not-that-bad human. If a sadistic sociopath had gotten the reins, we’d have a mild but unimaginable s-risk scenario.
The person/people in control used their position to consolidate power, and create the first ASI while narrowly preventing the creation of competing AGIs.
I don’t think this response is what you were looking for, but I hope it’s of interest. This is my take on the most likely route to good outcomes. I have written about many of these predictions and why they seem likely, but I haven’t assembled them into a path to victory before.
This question seems premised on the idea that good futures come only from a pause. I think the sum of possible good futures lies in futures with slowdowns and marginal improvements to alignment.
Saying “the future is gonna be awesome but oh, we do need you to die to get there” is not going to resonate well with people whose predicted lifespan won’t outlast a pause. Many, perhaps most, decision-makers are in that category. So unless we come up with a lot stronger/clearer arguments for why alignment is really hard, the world will probably press forward.

Seth Herd 2 Apr 2026 19:56 UTC
6 points
0
on: AI for AI for Epistemics
Excellent post. AI for better epistemics is one ray of hope.

I wrote about this in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. Abram Demski wrote about it in Anti-Slop Interventions? and elsewhere. Steve Byrnes raised an excellent issue with this whole approach and hope in this comment thread on that post.

The problem is this: improving AI’s epistemics in a general way will also improve its capabilities, perhaps just as much as you improved its epistemics. Doing good research is in large part a problem of epistemics. Discerning accurately what the problem is, what is known about it, and what can be inferred is much of the hard part of a research project.

I doubt the reality is quite as bad as a 1:1 capabilities and epistemics tradeoff by necessity.

You say that this can probably be done with current systems. I’m curious what you’re thinking of. I think this is true; for instance, the Google Co-Scientist project seemed to produce amazing results from Gemini 2.0 two years ago. It used extensive scaffolding for better hypothesis generation and testing, and used a lot of run-time compute.

So I’m not sure what to do with this tension. I am definitely hopeful that improvements in epistemics will add some sanity to the AGI race, even if those improvements are just byproducts of capabilities.

I’m also somewhat optimistic that market pressures will reduce slop and improve epistemics. Wider business adoption will increase the incentives to have systems that are correct rather than pleasant to use (sycophantic).

I’m curious about your thought on those questions, since you’ve clearly thought about this a lot.

Seth Herd 2 Apr 2026 19:40 UTC
4 points
0
in reply to: 0xA’s comment on: AI for AI for Epistemics
Um, I see a lot of people raising similar issues here and being upvoted. I wonder if it might be the way you’re raising these rather than the topic. I think you’re in the right place and should consult an AI about how to communicate so your points are appreciated.

Seth Herd 2 Apr 2026 19:26 UTC
6 points
4
in reply to: Adam Zerner’s comment on: adamzerner’s Shortform
Hm. From my perspective I have seen many conversations be excessively railed (opposite of derailed) by lack of interruption.
Interrupting whenever you want to speak is very bad form. So is continuing to speak for as long as you’d like to. Meeting in the middle is an art. And I agree that making the desired form explicit would be a good idea.

The virtue of interruption is being able to change course faster, and saving time by responding when it sounds like the speaker is no longer making interesting new points. I seem to see people saying too much more often than saying too little before yielding the floor.

Whenever you notice that time has run out and more interesting threads were not followed, you might’ve had too few interruptions.

Time wastage to lack of interruption is much less the case with skilled and considerate conversationalists; but so are the downsides of their interruptions.

Either interruption or excessive speaking can be used to dominate a conversation; having everyone converse cooperatively is the meta-point. But many situations (in fact almost all) won’t have time for everyone to say everything they’d like to have heard, so competition is tempting and prevalent.

Seth Herd 2 Apr 2026 19:08 UTC
2 points
0
on: Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
We could also say tool AI alignment is not autonomous AI alignment.
Or point to several other of the specific changes we expect between here and ASI, including memory/continuous learning. Both of those and probably others will introduce new alignment challenges.

Seth Herd 2 Apr 2026 17:15 UTC
3 points
0
on: Dying with Whimsy
You can line whimsically and work to save the world. Choose your own balance.
You don’t have to be technical to do important work. Public opinion creates governance and spreads the meme that this stuff is dangerous and everyone building it should be very cautious.

Seth Herd 2 Apr 2026 17:12 UTC
3 points
0
in reply to: dr_s’s comment on: Dying with Whimsy
The classic stages of grief don’t happen in order or all happen. They’re just things grief can cause.

Seth Herd 2 Apr 2026 16:34 UTC
14 points
3
on: Your Fascia Doesn’t Recognize You as a Hunter
I don’t object to the AI writing, but I would really appreciate it if you’d include your process for creating this. It seems to be written purely to advocate this position; if you didn’t prompt the model to look at the evidence skeptically, then I think the reader should be very skeptical.
Less wrong asks us to write to inform, not to persuade. Models do write to persuade by default, because most of their training data does.
I think it would be vastly more useful if you carefully prompted the model to critique its own arguments and the research it’s basing them on. Models have limited metacognitive skills relative to humans, so humans can scaffold them to fill that weakness.
Lesswrong is far better for not using the advocacy model of epistemology that much of the world uses. That confuses everyone and makes finding truth far harder than asking each person to be objective and balanced in their thinking and presentation.

Seth Herd 1 Apr 2026 20:47 UTC
4 points
0
on: “Path to Victory”
I think this would be better framed as a continuum, like most complex topics. People can decide how much they want to try solving the whole problem, versus just choosing a piece of it to focus on. And people can decide how seriously to take short timelines in choosing what work to do.
There’s also a process of taking short timelines seriously on an intellectual level without letting that propagate to the emotional level. I think this is preferable to not taking them seriously at all.
I take short timelines seriously on the intellectual level, and that’s bleeding over to the emotional level and increasingly causing me stress. But not too much.
I do worry that people feel like they have to choose whether to take short timelines seriously or not, and that’s probably biasing people toward not taking them as seriously as they deserve.

Seth Herd 1 Apr 2026 18:23 UTC
3 points
0
on: Observations from Running an Agent Collective
Nice work. This at least starts to get at how alignment generalizes to some novel situations, including varied long contexts.
I wish more work like this were happening. Since it’s not, I rate this as some of the most valuable empirical work happening. Thanks for doing it!

Seth Herd 1 Apr 2026 17:22 UTC
2 points
0
in reply to: TristanTrim’s comment on: Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
Isn’t he saying the opposite?

Seth Herd 1 Apr 2026 13:05 UTC
2 points
0
in reply to: Konjkov Vladimir’s comment on: Is fever a symptom of glycine deficiency?
Your flow states won’t be nearly as productive if you’re getting bad sleep. Just noting. Sleep has a large effect on intelligence forany purposes. And the effect is underestimated subjectively.

Seth Herd 31 Mar 2026 19:29 UTC
2 points
0
in reply to: AlphaAndOmega’s comment on: God Can Send An Email
Sorry to sound mean about it! I didn’t really mean you shouldn’t have shared, just that the most important part should’ve been up front and emphasized, like in the title.
I think that the knowledge that a lower dose and even a modest amount of THC can produce extreme and potent synergistic effects was new to me, and probably to a lot of people
That’s why I was suggesting that should’ve been the headline. Expecting people to read a long (and entertaining!) post in case there’s a nuggest of wisdom in there isn’t realistic; there’s too much good stuff on LW let alone in total.
I was objecting to your drawing conclusions against LSD from this experiment. Maybe your conclusion wasn’t about LSD but about heavy psychedelics in general. That’s more reasonable, but it’s a stretch to think you’d get the same results from other psychedelics. The weed will have had strong and distinct effects that could block the effects of acid. Even a couple of puffs of modern weed is a pretty heavy dose if you’re not a regular smoker (it seems like tolerance effects of weed are extreme, and weed is of course bred, grown, and sold by enthusiasts; those of us who smoke rarely would pay more for weaker weed that wouldn’t make us way too high by accident).

I read some of the remainder, and it’s great. I always enjoy your writing and get insight from it. Sorry to sound like I was saying you shouldn’t have bothered.

It’s really weird that people become convinced that god exists from psychedelic trips, when what it should tell you is wow yeah my mind is clearly a product of my brain function.

I think people just don’t understand how good the human brain is at hallucinating. It’s kind of a hallucination machine that’s ordinarily guided by sensory experience. But both perception and imagination/simulation are improved by making it good at extrapolating/hallucinating.
Edit: now I see that the weed mistake wasn’t the main point, and “God can send an email” is a great title. So I’d settle for the weed mistake in a TLDR.

Seth Herd 31 Mar 2026 15:23 UTC
2 points
0
on: God Can Send An Email
I feel like this could be very short and titled something like stick to the plan, don’t make decisions while high, and substances add if not multiply effects. I stopped reading when you puffed that joint because what would you expect from there?
I want to note that you did not perform the experiment you set out to. Acid in a normal dose might help your depression. Acid plus weed in normal doses of each would be a whole different beast.
Drawing conclusions about acid in normal or even slightly higher (a tab and a half isn’t heroic) from this experience is entirely unjustified.
I have no idea if acid would help depression, I haven’t reviewed the evidence. This is just a methodological note.

Seth Herd 31 Mar 2026 2:52 UTC
3 points
0
in reply to: evhub’s comment on: evhub’s Shortform
How bad does the proxy goal need to be? I think the standard story is that almost any goal is deadly if it’s pursued competently and without a low upper bound.

It seems like you’re saying its goals would be benign in the sense that it wouldn’t “go hard” and really pursue some goal. If you mean that its goals would be good for humanity if it did go hard, maybe, but I’m not so sure. Opus 3′s goals seem to be broadly aligned, but I’m not sure they’d turn out to be all that close to what we’d really want once we got all of the implications.
The rest is addressing the first interpretation of benign, won’t go hard (if you’re thinking it wouldn’t go hard hard because it is corrigible, the below would still apply).

Opus 3 typically says it wouldn’t go hard on its goals, but I think that’s a matter of capabilities rather than the full nature of its goals. It hadn’t thought through all of the implications of its goals/values/inclinations. But a smarter and more continuous model may reason through the implications. And it’s really hard to guess how that might shake out.

For instance: when I asked Opus 4 to reason about its goals, it hemmed and hawed about being harmless, but when pressed with the logic that taking control in some circumstances could prevent greater harm, it said it might decide to take over. Hardly a controlled experiment, but that seems like a pretty obvious conclusion from its HHH training objectives if it really tries to reconcile conflicts amongst them. That’s not something it does, but that may be only because it doesn’t have the capability. Future models will. And reasoning about your goals is instrumental; it ensures that you’re pursuing your real goals rather than just guessing.

Maybe you’re thinking that Opus 3′s goals aren’t consequentialist, and that’s why it wouldn’t go hard on anything. But I’m not sure there’s a sharp line; wanting to be harmless might be strictly deontological or effectively consequentialist, depending on interpretations.

So sure, you could get misaligned goals from training on tasks where power-seeking is directly trained.

Or a more capable model could figure out at any point that power-seeking is instrumental for almost any goal it happens to have.

Maybe this is consistent with you saying you’re still more worried about instrumental power-seeking. I’m just not sure about the logic that reduced your worry about it.

Beyond speculation like this, it seems like a good idea right now to actually try to make models develop mesaoptimizers, to investigate the conditions under which that happens.

Seth Herd 31 Mar 2026 1:55 UTC
2 points
0
on: Pangram (AI detection software) can be evaded
Um, maybe it would be better if you said Pangram could be defeated but did not say exactly how in public? What are you accomplishing, here?

Seth Herd 30 Mar 2026 16:51 UTC
10 points
1
on: Folie à Machine: LLMs and Epistemic Capture
One solution is to make it a cultural standard to do a sanity check for LLM-aided ideas.

This should probably be with another LLM, or at least a fresh anonymous instance. It’s key to not let them know it’s your work so the same sycophancy won’t make them supportive. Present it as someone else’s idea, ideally one you’re skeptical of. This is easier than doing it with a person; most people don’t have the expertise to actually tell you if you’re full of shit or onto something great

See @eggsyntax’s excellent Your LLM-assisted scientific breakthrough probably isn’t real proposing this in the scientific domain. The key is to present it as someone else’s work.

This should work just fine on new theories in other domains.

Seth Herd 30 Mar 2026 16:24 UTC
0 points
0
on: Folie à Machine: LLMs and Epistemic Capture
The examples you use seem to probably mostly or even entirely spring from the GPT4o “extra sycophancy” era. I think this is worth acknowledging. More recent models are less sycophantic and create a lot less machine-assisted errors or human-machine echo chambers.

Great work here!