Seth Herd comments on evhub’s Shortform

Seth Herd 8 Jun 2026 4:00 UTC
13 points
9
Gotcha. That all makes sense. I agree on almost every point. I definitely think we should be advocating for a global stop on AGI research, as well as slowdowns, as well as working on alignment research as fast as we possibly can, while trying to avoid advancing capabilities faster than alignment. These aren’t at all mutually exclusive. Everyone who sees the problem should be doing all of them IMO. I haven’t seen strong arguments against putting effort into all of these at once.

I think it IS obvious why it’s hard. Any international agreement is hard, and even people who think about p(doom) think it’s on the order of 10% and are mostly basically okay pushing ahead on those odds. This brings us to the “mindfucked somehow”. I think that’s probably basically right. I’ve laid out what I think are the relevant mechanisms, in a great deal of detail and with a great deal of empirical backing, in the post I referenced but didn’t explain: Motivated reasoning, confirmation bias, and AI risk theory. I wouldn’t use such a strong term because I don’t think it’s as strong as that implies—and calling people mindfucked is a great way to even further motivate them against you and everything associated with you.

Which brings us to the charge of socratic trolling. I certainly believe you when you say that’s not what you’re trying to do. But your intention has only a weak causal relationship to whether the effect is net-helpful or not. I had Said in mind when that term popped into my head. I followed that drama with great interest, and have very mixed feelings about it, but I think there’s a legitimate concern there. I think three years was way too long, and in the first go-round didn’t agree with Duncan Sabien’s calls for his banning (despite his amusingly biting the bullet and arguing it was good that Socrates was banished/killed for similar behavior). But I found Habryka’s arguments in Banning Said Achmiz (and broader thoughts on moderation) convincing in the face of Zach MD’s recent objections in Comment on “Banning Said Achmiz”. I’m not sure the time invested in all of those massive posts was worth it, but I feel I did exit with a little wisdom on the dynamics of what I’m going to keep calling socratic trolling, because it mixes those virtues and costs.

I believe you’re being completely sincere, and that you actually had very good reasons for asking that question, after this exchange. I think you’re correct that this assumption deserves to be questioned and debated.

Framing it as you did without context looks somewhat combative to my eye, and also seems like an isolated demand for rigor. So I stand by my initial assessment of “concerning and probably more harm than good”. I think it’s pretty easy to mitigate the downsides while keeping the upsides. Showing proof of effort by laying out the reasons you’re asking make it clear that you’re not trolling. And deliberately framing the question in a friendly way avoids the look of it being combative or adversarial, and thereby sparking more argument and “mindfucking” polarization (on both sides, obviously, since rationalists unfortunately still have feelings and motivations and halo/horns effects).

Anyway, this has been a useful exchange and I appreciate it. I think we’re all amateurs at what it takes to get a global treaty on pausing AI research; experts on international treaties are not at all experts on the argument that would drive it forward, and vice-versa.

Anthropic did explicitly say in their recent piece on RSI that they would start investigating how to get a pause or slowdown. It seems like other concerned parties like you and I should also chip in a little research here and there, even if it’s not our domain of direct expertise.
- TsviBT 8 Jun 2026 4:11 UTC
  6 points
  0
  Parent
  
  Framing it as you did without context looks somewhat combative to my eye,
  
  I usually have a pretty bad intuitive reaction to people telling me that my writing is “too combative”, but I don’t know why fully, and in a small fraction of the cases I end up agreeing with them later, so I won’t respond further. I do want to say what I just said, though; not sure why, but maybe, for example, I feel that it’s somehow unfair, though since I can’t explain how this is a very unreliable sense.
  
  and also seems like an isolated demand for rigor
  
  I think it’s a really important and central supposition, and apparently has not been rigorously defended anywhere! This is 100% definitely not what the concept of “isolated demand for rigor” is for! What other such things am I allegedly not demanding rigor for, that I ought to be? Or do you disagree / am I confused somehow?
  
  So I stand by my initial assessment of “concerning and probably more harm than good”.
  
  Ok. (If you wanted to update me personally, you haven’t done that on this point.)
  
  Showing proof of effort by laying out the reasons you’re asking make it clear that you’re not trolling.
  
  Until this point, I have been genuinely unsure if there’s simply a report / blog post / something that someone might just link me to, explaining the case!! But yes, I think you’re right, I now agree it would be better for a comment like my first to come with a few sentences explaining that it seems like there isn’t such a case, that a global stop seems plausibly feasible, that Anthropic seems super far from appropriately supporting that, and that they should.
  - Seth Herd 8 Jun 2026 4:35 UTC
    5 points
    3
    Parent
    Fair. I’m aware that people react negatively to being tone policed. Somewhat ironically, this can motivate people against the concept of motivated reasoning effects I think are pretty crucial to our understanding and therefore our survival. So I’ve avoided doing that for the most part, and regretted doing it. I pressed on because you didn’t immediately react badly. I’m still not sure if bringing this up is harmful or helpful.
    
    But I have heard people from the developer side of the fence say that they find LW a hostile environment and have trouble engaging here even though they feel they should. And the tone of discussions here certainly look like tribal dynamics and polarization are happening. So I think this is an important topic, although I’m not at all sure I’m raising it the right way. That big MR post was my due diligence on making sure I know what I’m talking about; now I need to decide whether and how to push the issue more forcefully.
    
    I expect I haven’t changed your mind because you haven’t read my careful research and arguments. I wouldn’t expect you to change your mind without evidence. Of course the evidence I present in that post isn’t airtight, but to me it looks awfully likely that being at least somewhat deliberately warm/nice is worthwhile if you want to win people over to your side. That seems particularly likely if there’s anything debatable at all. If the subject is really cut-and-dried I think you can push harder and succeed, but you still slow rate of progress if your approach is setting off enemy-warning signals in the audience’s brains.
    
    Anyway, thanks for engaging that far. This is an important topic to me, and I haven’t really engaged on it here before. So I appreciate it.
    
    On your other point: I am also unsure if there’s a report or blog post that more thoroughly lays out the case for pause being impossible. I doubt there’s a particularly good one.
    
    The inverse of such a report is a careful arguement for how it is possible. I’ve laid out bits and pieces of a case for how slowdown is quite plausible, even possibly as a default on the current trajectory, but not in one coherent place or for a full pause. But the arguments can be extended for the possiblity of a pause.
    
    I won’t dive into that further right now, but I think it is a worthy collaborative project for LWers.
    - TsviBT 8 Jun 2026 4:44 UTC
      4 points
      2
      Parent
      
      But I have heard people from the developer side of the fence say that they find LW a hostile environment and have trouble engaging here even though they feel they should. And the tone of discussions here certainly look like tribal dynamics and polarization are happening.
      
      This makes sense. I will note however that when I (one time) asked an Anthropic employee about inviting someone over to their offices to explain / argue more in depth some crucial point (I forget; I think alignment difficulty), they said something like “last one or two times we tried that, the guest was dismissive”. So like, it looks a whole lot more like the crux is self-insulation, even if there is also undue hostility on LW. But, that is N=1. (I have other Ns that look like self-insulation, though of course that’s almost inherently ambiguous and my total N is small.)
      
      (I do think in past I’ve at least watched from the sidelines, or even slightly participated in, arguably-undue dogpiley polite arguing, if not hostility.)
      
      I won’t dive into that further right now, but I think it is a worthy collaborative project for LWers.
      
      Definitely agree. (Cf. https://www.lesswrong.com/posts/Sdrzo7z3STzdrnwKW/what-exactly-would-an-international-ai-treaty-say-is-a-bad and https://www.lesswrong.com/posts/X9Z9vdG7kEFTBkA6h/what-could-a-policy-banning-agi-look-like )
      - Seth Herd 8 Jun 2026 16:25 UTC
        8 points
        5
        Parent
        Interesting. Both of those posts have the form of “what would an agreement say” which I think is totally missing the hard part. So I think that points at the answer to your original question, and why others regard it as obvious and you do not.
        
        The answer is “because there’s no political will”. And so the question isn’t what would an agreement say, it’s where would the political will come from.
        
        My answer is that the political will will come from AI progress, particularly from visible job loss and from human-seeming AI systems, which will trip pattern-matching to strange humans, which we intuitively regard as quite dangerous. Xenophobia exists for a reason; strange humans have been among our biggest dangers since the start of evolution. I’ve written about this in A country of alien idiots in a datacenter: AI progress and public alarm and bits and pieces elsewhere.
        
        On the positive side, this answer is that the political will will come, which shifts the question back to having an agreement or treaty ready to offer.
        
        The problem with this answer is that the will might come too late. By the time systems are visibly taking jobs and acting agenticly and competently, we might already have a takeover-capable system in development, and it will be too late for anything as slow as international agreements. There still might be time for an executive order and informal power grabs in the face of adequate public (and politician) freakout. And that might be substantially useful, since only China is near the US, and they’d probably take a much more cautious approach to AGI and alignment (see China won’t win the AI race but would it be much worse if it did? and similar). I’ve written about this in Whether governments will control AGI is important and neglected and I now think the answer is just clearly yes, but maybe not in time.
        
        That’s a slowdown not a pause, but it could perhaps be expanded to a pause if the discussion can move quickly between the US and China. I think this is possible. We’re really not enemies, just competitors. And our researchers are quite well-disposed toward each other.
        
        Anyway, I think the major crux between us and the rest of the world is alignment risk. The average belief even among those that acknowledge the risk (which tbf is now pretty much anyone thinking about AGI) is maybe ten percent or lower. That’s enough to make some people want to pause, but not I think most of them, and not enough to make it a high priority.
        
        So I think the clearest path toward pause (or slowdown) is clearer arguments about alignment risk.
        
        Here I think overclaiming on the technical arguments has done grevious harm to the cause. Yudkowsky’s claim that misalignment is 99%+ likely has drawn much irritation and ire and attention. People, even sophisticated people, routinely argue “alignment is possible” instead of arguing about how possible. I think they’re quite correct that Yudkowsky’s technical argument is full of holes, but quite wrong in the implied leap from there to ~10% risk. Alignment may be quite achievable and still quite difficult to achieve on the current rushed path.
        
        I think arguments for human incompetence on first tries and under pressure are a much better bet. To his credit, EY has shifted hard in this direction, and so have the handful of others making technical arguments for alignment difficulty. My arguments center on model uncertainty: nobody knows how hard alignment is; estimates from people with real time-on-task range from very low to very high; therefore the wisest assumption is that it could be extremely difficult and we are foolish to press ahead with so much unknown.
        
        Here I think we could do vastly better. Optimists reason that current systems seem pretty aligned, so we’re probably on track to align more powerful systems. Pessimists argue that this isn’t useful evidence at all. Identifying cruxes and improving models of likely first AGI seems quite achievable, so that’s what I’m primarily working toward and asking others to engage in.
        
        WRT the Anthropic office visits: This has the general form of “it’s their fault not ours” which is suspicious. In most disagreements, both parties blame the other. And even if it is totally their fault, I’d rather survive than assign blame. Usually the way forward in resolving interpersonal issues is “sorry about that, let’s try again” and then be nicer.
        
        This is when you’re trying to reach mutual agreement with someone, not when you’re trying to negotiate a deal and have some leverage. Discussions about beliefs only resemble negotiations when the evidence is overwhelming. And on the dangers of alignment, it’s just unfortunately not.
        TsviBT 8 Jun 2026 16:47 UTC
        3 points
        0
        Parent
        
        Both of those posts have the form of “what would an agreement say” which I think is totally missing the hard part. So I think that points at the answer to your original question, and why others regard it as obvious and you do not.
        
        You might be overinferring what I think these blog posts indicate? I’m just gesturing that I agree that the overall project of figuring out how the whole thing might be feasible is a worthy project.
        
        The answer is “because there’s no political will”.
        
        I know that this is a thing people say, and I agree there isn’t already automatically political will pre-gathered. But if the implication is that it would be an infeasible task to create and gather the political will for a global stop, that implication is one I strongly question! And so far I hear lots of signs pointing in the opposite direction, and grateful to the people working on that. I just wish that Anthropic would support those efforts.
        
        WRT the Anthropic office visits: This has the general form of “it’s their fault not ours” which is suspicious.
        
        Not blaming, describing. Can’t survive without describing.
        
        (Anyway, just FYI, your time might be somewhat wasted if you want to get me on board with a particular approach / stance, because I’m much more commenting from the sidelines rather than an active participant; I’m focusing on other things, while others are actually working on communicating with the public and political leaders and so on.)
        Seth Herd 8 Jun 2026 22:55 UTC
        3 points
        1
        Parent
        That’s fine, I’ll consider it workshopping.
        
        I’m also primarily occupied with other things. I’m spending some time on communication strategy and the logic of how opinion and policy could change, because it seems like it could be critically important, and not enough people seem to be thinking about it. As you note.
        
        I hope you will too.