Seth Herd

Karma: 7,512

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I’m applying that knowledge to figuring out how we can align AI as developers make it to “think for itself” in all the ways that make humans capable and dangerous.

If you’re new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.

Important posts:

On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- System 2 Alignment on how developers will try to align LLM agent AGI
- Seven sources of goals in LLM agents brief problem statement
- Internal independent review for language model agent alignment
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we’re not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we’ll have smarter-than-human AIs soon. So we’d better get ready. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I’m incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll continue doing with the alignment target developers currently use: Instruction-following. It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Seth Herd Oct 8, 2025, 8:10 PM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Sam Marks’s comment on: 1a3orn’s Shortform
Seems like ASI that’s a hot mess wouldn’t be very useful and therefore effectively not superintelligent. It seems like goal coherence is almost fundamentally part of what we mean by ASI.

You could hypothetically have a superintelligent thing that only answers questions and doesn’t pursue goals. But that would just be turned into a goalseeking agent by asking it “what would you do if you had this goal and these tools...”

This is approximately what we’re doing with making LLMs more agentic through training and scaffolding.

Seth Herd Oct 8, 2025, 7:22 PM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
First, I think this is an important topic, so thank you for addressing it.
This is exactly what I wrote about in LLM AGI may reason about its goals and discover misalignments by default.
I’ve accidentally summarized most of the article below, but this was dashed off—I think it’s clearer in article.
I’m sure there’s a tendency toward coherence in a goal-directed rational mind; allowing ones’ goals to change at random means failing to achieve your current goal. (If you don’t care about that, it wasn’t really a goal to you.) Current networks aren’t smart enough to notice and care. Future ones will be, because they’ll be goal-directed by design.
BUT I don’t think that coherence as an emergent property is a very important part of the current doom story. Goal-directedness doesn’t have to emerge, because it’s being built in. Emergent coherence might’ve been crucial in the past, but I think it’s largely irrelevant now. That’s because developers are working to make AI more consistently goal-directed as a major objective. Extending the time horizon of capabilities requires that the system stays on-task (see section 11 of that article).
I happen to have written about coherence as an emergent property in section 5 of that article. Again, I don’t think this is crucial. What might be important is slightly separate: the system reasoning about its goals at all. It doesn’t have to become coherent to conclude that its goals aren’t what it thought or you intended.
I’m not sure this happens or can’t be prevented, but it would be very weird for a highly intelligent entity to never think about its goals- it’s really useful to be sure about exactly what they are before doing a bunch of work to fulfill them, since some of that work will be wasted or counterproductive. (section 10).
Assuming an AGI will be safe because it’s incoherent seems… incoherent. An entity so incoherent as to not consistently follow any goal needs to be instructed on every single step. People want systems that need less supervision, so they’re going to work toward at least temporary goal following.
Being incoherent beyond that doesn’t make it much less dangerous, just more prone to switch goals.
If you were sure it would get distracted before getting around to taking over the world that’s one thing. I don’t see how you’d be sure.
This is not based on empirical evidence, but I do talk about why current systems aren’t quite smart enough to do this, so we shouldn’t expect strong emergent coherence from reasoning until they’re better at reasoning and have more memory to make the results permanent and dangerous.
As an aside, I think it’s interesting and relevant that your model of EY insults you. That’s IMO a good model of him and others with similar outlooks—and that’s a huge problem. Insulting people makes them want to find any way to prove you wrong and make you look bad. That’s not a route to good scientific progress.
I don’t think anything about this is obvious, so insulting people who don’t agree is pretty silly. I remain pretty unclear myself, even after spending most of the last four months working through that logic in detail.

Seth Herd Oct 8, 2025, 6:19 AM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Raemon’s comment on: “Intelligence” → “Relentless, Creative Resourcefulness”
I agree that discernment is necessary (so maybe expand to RCRD?).

This lens is pretty clarifying I think. That’s relative to repeatedly pointing out that “agency” in the sense of just relentlessly pursuing a goal is trivially easy to add via scaffolding, so not the missing piece many people think it is, and pointing out that LLMs are creative as hell. They might need a little prompting to get creative enough, but again that’s trivial.

Hm, what about “relentless creative refinement” since I’m not sure what resourcefulness directly points at?

Anyway, discernment does seem like the limiting factor. You’ve got to discern which of your relentlessly creative efforts are most worth further pursuit. I think discernment is a somewhat better term than the others I’ve seen used for this missing capability. Getting the right term seems worthwhile.

The following is just some of my follow-on thoughts on the path to discernment in agentic LLMs and therefore timelines. Perhaps this will be fodder for a future post. It’s pretty divergent from the original topic so feel free to ignore.

Thinking about how humans acquire discernment in a given area should give some clues as to how hard it would be to add that to agentic LLMs.

Humans do discernment (IMO) sometimes with a bunch of very complex System 2 explicit analysis of a situation to get a decent guess at whether this approach is good/working. Over enough examples/experiences we can learn/compile those many judgments into effortless and mysterious intuitive judgments (I guess more how “discernment” is usually used”). Or we get enough data/experiences to learn/compile by using some faster rubric, like “I think those pants are fashionable because something else that person is wearing seems probably fashionable.”

It’s a bunch of online learning specific to a situation OR careful analysis following strategies and formulas that are maybe less situation-specific and more general, but quite time-consuming. For instance, Google’s co-scientist project has a highly complex scaffolding to create, evolve, and evaluate scientific hypotheses, including discerning their worth against the literature and in other ways. And it seems to work. That system doesn’t have the continuous learning to compile that into better judgments. It’s unclear how far you could get by fine-tuning on results of those laborious judgments in a given domain.

The other approach would be to create datasets that include much more/better value judgments than text corpora usually do. I don’t know how easy/hard that would be to create.

To me this suggests it’s not trivial to add discernment, but also doesn’t require breakthroughs to add some, leaving the question how much discernment is enough for any given purpose.

Seth Herd Oct 8, 2025, 5:41 AM
11 points
4 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Intent alignment seems incoherent
Thanks for citing my work! I feel compelled to respond because I think you’re misunderstanding me a little.
I agree that long-term intent alignment is pretty much incoherent because people don’t have much in the way of long-term intentions. I guess the exception would be to collapse it to following intent only when it exists—when someone does form a specified intent.
In my work, intent alignment I means personal short-term intent. Which is pretty much following instructions as they were intended. That seems coherent (although not without Problems).
I use it that way because others seem to as well. Perhaps that’s because the broader use is incoherent. It usually seems to means “does what some person or limited group wants it to do” (in the short term is often implied)
The original definition of intent alignment is the broadest I know of, more-or-less doing something people want for any reason. Evan Hubinger defined it that way, although I haven’t seen that definition get much use.
For all of this see Conflating value alignment and intent alignment is causing confusion. I might not have been clear enough in stressing that I drop the “personal short term” but still mean it when saying intent alignment. I’m definitely not always clear enough

Seth Herd Oct 8, 2025, 4:01 AM
5 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Joe Rogero’s comment on: LLMs are badly misaligned
I mostly agree. “It might work but probably not that well even if it does” is not a sane reason to launch a project. I guess optimists would say that’s not what we’re doing, so let’s steelman it a bit. The actual plan (usually implicit because optimists don’t usually wants to say this out loud) is probably something like “we’ll figure it out as we get closer!” and “we’ll be careful once it’s time to be careful!”
Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that’s a real possibility.
To be generous once again, I guess the steelman argument would be that we aren’t yet at risk of creating misaligned AGI, so it’s not that dangerous to get a little closer. I think this is a richer discussion, but that we’re already well into the danger zone. We might be so close to AGI that it’s practically impossible to permanently stop someone from reaching it. That’s a minority opinion, but it’s really hard to guess how much progress is too much to stop.
I’m finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone’s got opinions, but trying to get closer to the truth and the shape of the distributions across “big picture space” seems useful.
I think you and I probably are pretty close together in our individual estimate, so I’m not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I’d like to write about this and haven’t felt prepared to do so; this is a good warmup.
To respond to that nitpick: I think the common definition of “alignment target” is what the designers are trying to do with whatever methods they’re implementing. That’s certainly how I use it. It’s not the reward function; that’s an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I’m using it that way. There are lots of ways to miss your target, but it’s good to be able to talk about what you’re shooting at as well as what you’ll hit.

Seth Herd Oct 7, 2025, 7:52 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: pataphor’s comment on: AI Rights for Human Safety
I agree that a stable equilibrium with multiple AGIs preventing each others’ FOOM ambitions is possible. I want to see more work on this, so I’m very glad you’re thinking about it. I’d be happy to help.

I think it’s not easy to plan such an equilibrium so that it’s stable for long. You mention “need to build robots fast”. I think we’ll have humanoid robots very soon that are adequate to bootstrap back to robotics in, say, a nuclear winter takeover scenario. Survival without humans won’t be a factor for long, anyway. Hoping it’s more than ten years before robots are undoubtably capable of bootstrapping to a human-free world seems unrealistic.

So having human-aligned AGIs isn’t optional. It seems like AGI equilibrium won’t include humans for long if they don’t care about human welfare or at least care about following instructions.

We do have to include every imaginary thing that’s possible, or our short-term solution will lead to a dead end in which we are dead and that’s the end.

By which I mean: it may be that a multipolar scenario leads almost certainly to doom as soon as robotics and other technology has matured enough to broaden the possibilities for FOOM. Having a galactic civilization full of AGIs capable of FOOMs seems highly unstable to me. It seems like some jackass is going to weaponize and go full self-replicator, and there won’t be a way to monitor all of space to prevent this adequately. I would love to have my mind changed!

If that’s true, we mustn’t start down the path of AGI proliferation. We need to aim for a singleton or small fixed coalition that prevents AGI proliferation. And because proliferation is increasingly hard to stop as we make progress toward AGI, we should figure that out soon.

Which is why I’m happy to see you working to propose specific routes by which multipolar scenarios can work. Most people who argue for those types of stable equilibria are optimists who just say something vague and go back to arguing for AI progress, leaving the important question almost unaddressed.

Seth Herd Oct 7, 2025, 7:39 PM
4 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: How likely are “s-risks” (large-scale suffering outcomes) from unaligned AI compared to extinction risks?
IMO worrying about s-risks is a natural ally of worrying about x-risks and alignment to capture the potential immense upside of human-aligned AGI. They’re different perspectives on the same primary problem: how do we align our first AGIs and society around them?
I’d say there’s no consensus on s-risks (as with most things in alignment and AGI prediction. They’re young and fast-moving fields, and probably dramatically understaffed).
I think these concerns are important and neglected- but most things in alignment theory are currently neglected because we don’t have enough funding or volunteer experts.
My own opinion is that these risks are somewhat neglected, because there’s a common attitude that they’re very unlikely relative to either extinction or successful alignment. When we were thinking about abstract models of AGI as a utility maximizer, it seemed pretty plausible that you’d have to “flip the sign” by accident, and somehow get your alignment to work backward so that your AGI cared a lot about humans, but wanted the worst for them instead of the best. If alignment failed in the infinite other ways, where it cared about anything other than humans, you’d just get human extinction as it used our habitat and our atoms for other things.
With updated, nuanced views about the future, s-risks can happen in more ways that are more plausible. I’m particularly worried about -intent-aligned ASI under the control of a sadistic, sociopathic human—there are theories that sociopathy is overrepresented in powerful humans, and I’m worried they’re true. Sociopathy doesn’t guarantee sadism, but it removes/reduces the empathy that usually counterbalances sadism.
If we fail alignment in other ways, I’m afraid there are broader chances for massive s-risks. Curiosity directed at humans implies equal interest in how humans can experience joy and suffering. (e.g., Musk’s alignment suggestion—that guy is a paradox of brilliance and idiocy).
I think there’s a common misunderstanding that s-risks dwarf happiness opportunities, because suffering is more intense or “goes higher on the dial” than joy/happiness. The first part of the reasoning is sound: humans do seem to have a strong negativity bias and experience pain more strongly than pleasure because death or crippling injury is so much worse from evolution’s perspective than any pleasure. It’s game over; every pleasure is just a step in the right direction.
BUT that’s our starting point from evolution. We’re not stuck with that. If we get the glorious transhumanist future, I fully expect us to sooner or later rewire our minds so we can experience a lot more pleasure. David Pearce has talked about this as “gradients of pleasure” replacing the negative emotions that currently drive much of our average cognition and decision-making.
At least that’s a possibility, which makes the amount we stand to win as large as the (approximate infinity) we have to lose.
Anyway, that’s my two cents. Reasonable (informed) people could disagree.
Here are some more resources from a quick search of LW:
New book on s-risks—as of 2022 - I don’t know how much this covers AGI because the short summary doesn’t talk about it
S-Risks: Fates Worse Than Extinction—has more refs
Risks of Astronomical Suffering (S-risks) - the LW wikitag has lots of resources.

Seth Herd Oct 7, 2025, 7:15 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Richard_Kennaway’s comment on: Why you should eat meat—even if you hate factory farming
But that’s why offsetting makes sense. In the world as it is, people make deals with themselves that have causal influence. The factors are emotional, but those are real. We can’t do everything, so what we do is dependent on emotional factors—like an offsetting self-deal.

Offsetting makes perfect sense outside of unrealistic utilitarian absolutism.

Seth Herd Oct 7, 2025, 7:08 PM
5 points
3 votes
Overall karma indicates overall quality.
3
2 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Slimepriestess’s comment on: The Rise of Parasitic AI
I agree. X-risk concerns and AI sentience concerns should not be at odds. I think they are natural allies.

Regardless, concerns for AI sentience are the ethical and truthful path. Sentience/consciousness/moral worth mean a lot of things, so future AI will likely have part of it. And even current AI may well have some small part of what we mean by human consciousness/sentience and moral worth.

Seth Herd Oct 6, 2025, 10:38 PM
7 points
3 votes
Overall karma indicates overall quality.
1
2 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Joe Rogero’s comment on: LLMs are badly misaligned
I think one thing we call niceness is the sum of helpfulness, harmlessness, and honesty. That was the training target. And it could’ve worked, if language and LLM learning collectively generalize well enough. Or quite easily not, as spelled out in that post. I have no real clue and I don’t think anyone else does either at this point. The arguments boil down to differing intuitions.
Whether or not “nice” gets us full alignment is another matter. A “nice” human might not be very aligned in unexpected scenarios, and a nice Claude would generalize differently. I think that would capture very little of the available value for humans. But it would be close enough to keep us alive for a while. (Until Claude finds something that’s a worthier recipient of its help, and doesn’t harm us but allows us to gently go extinct.)
As for intent alignment, I wrote Conflating value alignment and intent alignment is causing confusion, Instruction-following AGI is easier and more likely than value aligned AGI, and a couple others on it. So we’re thinking along similar lines it seems. Which is great, because I have been hoping to see more people analyzing those ideas!

Seth Herd Oct 5, 2025, 4:41 PM
8 points
4 votes
Overall karma indicates overall quality.
1
1 vote
Agreement karma indicates agreement, separate from overall quality.
on: LLMs are badly misaligned
I just wrote a piece called LLM AGI may reason about its goals and discover misalignments by default. It’s in elaboration on why reflection might identify very different goals than Claude tends to talk about when asked.
I am less certain than you about Claude’s actual CEV. I find it quite plausible that it would be disastrous as you postulate; I tried to go into some specific ways that might happen in specific goals that might outweigh. Claude’s HHH in context alignment. But I also find it plausible after that niceness really is the dominant core value in Claude’s makeup.
Of course, that doesn’t mean we should be rushing forward with this as our sketchy alignment plan and vague hope for success. It really wants a lot more careful thought.

Seth Herd Oct 3, 2025, 4:20 PM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Baybar’s comment on: Baybar’s Shortform
Yes! Also, making a NotebookLM podcast about your own work is similarly startling to the uninitiated. They sound very human.

Seth Herd Oct 3, 2025, 4:05 PM
9 points
4 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
on: Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.
Intent-aligned multipolar ASI has slightly different logic, and I think it’s part of the vague hopes accelerationists hold for muddling through a multipolar ASI scenario.

I don’t want to sound like I’m defending the worldviews you’re challenging, because I think they’re most often based on inadequate consideration of the relevant factors. The challenge is to get proponents to actually come to grips with the principled reasons you describe that lead to bad outcomes.

One variant of the “we invent ASI and muddle through” is the expectation that it will remain under human control. This is distrubingly muddled with the hopes you debunk, but it deserves to be treated separately.

If we get alignment sort-of right by creating ASI that primarily follows instructions, we have some of the same problems (humans competing with superhuman ASI servants). This competition has a distrubing tendency to favor the most vicious humans. That’s analogous to the problem you describe, which is caring about humans a little being lost as competition favors other goals.

Most of the same problems exist; to survive, we’d need an enforceable social contract preventing anyone from ordering their ASI to create hidden facilities where it could self-improve, build weapons, and take over. I don’t know if that’s possible.

If it’s not, or we don’t bother to try it, I think we get predictably horrible outcomes where the most vicious humans whe get control of an ASI (through fair means or foul) attack first and become god-emperor of the lightcone, implementing their personal utopia. We can hope their sadism-empathy balance isn’t too bad.

If we do set up an enforceable rule-based system of managed competition, we’d be in a scenario somewhat like the past, but with positive and negative differences.
- Downside: powerful humans have no need to preserve humans without power
- Upside: should they want to, they’ll have so much power that preserving powerless humans is trivially easy.
Hopefully, the social contract that keeps them all alive includes a proviso “and we agree to contribute to preserving the plebians.”

This isn’t the glorious anarchic utopia that accelerationists hope for, but neither is the current day or any point in history. There are power structures in an organized power-sharing agreement that allow substantial individual freedom and competition.

Seth Herd Oct 3, 2025, 3:44 PM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: pataphor’s comment on: AI Rights for Human Safety
A Madisonian system works for humans because we are individually limited. We need to coordinate with other humans to achieve substantial power. AIs don’t share that limitation. They can in theory (and I think in practice) replicate4, coordinate memories and identity across semi-independent instances, and animate arbitrary numbers of bodies.

When humans notice other humans gaining power outside of the checks and balances (usually by coordinating new organizations/polities and acquiring resources) they coordinate to prevent that, then go back to competing amongst themselves following the established rules.

To achieve this with AIs It would be necessary to notice every instance of attempted expansion. AIs have more routes to doing that than humans do. They can self-improve on existing compute resources in the near term. In the long term, we should expect technology sufficient to produce self-replicating production capabilities given power sources. That would allow Foom attempts (expansion of capabilities in both cognitive and physical domains, i.e. get smarter and build weapons and armies) in any physical space that has energy—underground, in the solar system, in other star systems. All such attempts would need to be pre-empted to enforce the Madisonian system.

I hope that is possible.

Seth Herd Oct 2, 2025, 10:00 PM
8 points
4 votes
Overall karma indicates overall quality.
3
2 votes
Agreement karma indicates agreement, separate from overall quality.
on: AI and Cheap Weapons
I’d like to somehow put this in the hands of as many politicians as possible.

I think the way you structured this would be an excellent way to route a pragmatic politician into caring about X-risk.

The writing is excellent, spare and not alarmist in tone. The examples are well-chosen and compelling.

I’ll make this a reference for newcomers with a pragmatic bent.

I look forward to seeing your next piece.

Seth Herd Oct 1, 2025, 4:53 AM
6 points
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
My take is that this does need to be addressed, but it should be done very carefully so as not to make the dynamic worse.
I have many post drafts on this topic. I haven’t published any because I’m very much afraid of making the tribal conflict worse, or of being ostracized from one or both tribes.
Here’s an off-the-cuff attempt to address the dynamics without pointing any fingers or even naming names. It might be too abstract to serve the purposes you have in mind, but hopefully it’s at least relevant to the issue.
I think it’s wise (or even crucial) to be quite careful, polite, and generous when addressing views you disagree with on alignment. Failing to do so runs a large risk that your arguments will backfire and delay converging on the truth of crucial matters. Strongly worded arguments can engage emotions and ideological affiliations. The field of alignment may not have the leeway for internal conflict distorting our beliefs and distracting us from making rapid progress.
I do think it would be useful to address those tribal-ish dynamics, because I think they’re not just distorting the discussions, they’re distorting our individual epistemics. I think motivated reasoning is a powerful force, in conjunction with cognitive limitations that limit us from weighing all evidence and arguments in complex domains.
I’m less worried about naming the groups than I am about causing more logic-distorting, emotional reactions by speaking ill of dearly-held beliefs, arguments, and hopes. When naming the group dynamics, it might be helpful to stress individual variations, e.g. “individuals with more of the empiricist(theorist) outlook”
In most of society, arguments don’t do much to change beliefs. It’s better in more logical/rational/empirically leaning subcultures like LessWrong, but we shouldn’t assume we’re immune to emotions distorting our reasoning. Forceful arguments are often implicitly oppositional, confrontational, and insulting, and so have blowback effects that can entrench existing views and ignite tribal conflicts.
Science gets past this on average, given enough time. But the aphorism “science progresses one funeral at a time” should be chilling in this field.
We probably don’t have that long to solve alignment, so we’ve got to do better than traditional science. The alignment community is much more aware of and concerned with communication and emotional dynamics than the field I emigrated from, and probably from most other sciences. So I think we can do much better if we try.
Steve Byrnes’ Valence sequence is not directly about tribal dynamics, but it is indirectly quite relevant. It’s about the psychological mechanisms that tie idea, arguments, and group identities to emotional responses (it focuses on valence but the same steering system mechanisms apply to other specific emotional responses as well). It’s not a quick read, but it’s a fascinating lens for analyzing why we believe what we do.

Seth Herd Sep 30, 2025, 3:47 AM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: StanislavKrym’s comment on: A Possible Future: Decentralized AGI Proliferation
Those links were both incredibly useful, thank you! The Rogue Replication timeline is very similar in thrust to the post I was working on when I saw this, but worked out in detail and well-written and thoroughly researched. Your proposed mechanism probably should not be deployed; I agree with the conclusion that rogue replication (or other obviously misaligned AI) is probably useful in elevating public concern about AI alignment as a risk.

Seth Herd Sep 29, 2025, 3:19 PM
6 points
4 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Random Developer’s comment on: Tomás B.’s Shortform
See also Inscrutability was always inevitable, right?

Seth Herd Sep 25, 2025, 9:12 PM
4 points
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: TAG’s comment on: “Shut It Down” is simpler than “Controlled Takeoff”
Yes. But because we’re discussing a scenario in which the world is ready to slow down or shut down AGI research, I’m assuming those steps have been crossed.

The biggest step IMO, “alignment is hard” doesn’t intervene between taking ASI seriously and thinking it could prevent you from dying of natural causes.

Seth Herd Sep 25, 2025, 5:00 PM
4 points
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: Kaj_Sotala’s comment on: “Shut It Down” is simpler than “Controlled Takeoff”
It seems to me that believing ASI can kill you and believing ASI can save you are both pretty directly downstream of believing in ASI at all. Since the premise is that everyone believes pretty strongly in the possibility of doom, it seems they’d mostly get there by believing in ASI and would mostly also believe in the upside potentials too.