Seth Herd comments on Seth Herd’s Shortform

Seth Herd 13 Jul 2025 17:09 UTC
30 points
6
We Need Better Plans for Short Timelines:
Plans are worthless, but planning is everything.
I believe this wholeheartedly. Planning demands thinking realistically about the situation and the goals, and that dramatically ups the odds of success. Plans won’t perfectly address what actually occurs, but that’s evidence you need more, not less, planning.
The responses to @Marius Hobbhahn’s What’s the short timeline plan? convinced me that we are in need of better plans for alignment. Fairly complete plans were given for control and interpretability, but those only aid and backstop alignment; we need better plans that cross the finish line to aligned AGI in the too-likely case that it happens soon. Scenario projects like Takeover in 2 Years and AI 2027 (and other work, perhaps including some of mine) are a good start at thinking realistically about specific routes to AGI on the current path, but we need good plans that turn those disaster scenarios into success.
That’s why I was excited to learn of Peter Gebauer’s project to create a contest for short timeline planning, now up on Manifund. Peter has already done a lot of work thinking through how a contest could be structured to produce good plans, but funding is needed for prize money and for time to collaboratively figure out what contest structure will produce useful plans. Planning how to plan may sound too indirect, but it’s part of how large-scale projects succeed. Another objection is planning is wasting the valuable time of our most valuable researchers. But good planning demands really understand the technical challenges.
The remainder is a little more on why I think planning for short timelines is so crucial and neglected.
Alignment needs to be more efficient than other sciences, particularly if time is short.
I spent a couple of decades in cognitive neuroscience marveling at the lack of broad-level planning. The field was following local incentive gradients, not following a plan for efficient success as a whole endeavor, because that’s not how science works. In funding structures for science (including alignment), it’s nobody’s job to produce field-wide plans for achieving some goal. (Grantmakers and to a lesser extent grant proposals do think about moving the field in a useful direction, but typically do not produce plans aimed at overall success in a given timeframe.)
While science must be done, technical alignment will also ultimately be an engineering effort. Those rely on careful planning. The orgs who will develop the first truly dangerous AGIs will certainly do some planning, but I think you’ll agree that they could probably use some outside help.
The current de facto plan (which I hadn’t really understood before talking to Peter) seems to be:
- Lay the groundwork by studying current networks
- Buy time with governance and control
- Make an actual plan when we get closer
- So we can plan for the type of AGI we’ve actually got to align.
This is a perfectly reasonable plan, except when you take short timelines seriously. In that case, the time to make an actual plan is yesterday, and we can already see enough of what type of AGI we’ll align to do so.
I think it’s emotionally difficult for most of us to take short timelines seriously. I think it’s a huge Ugh field and source of Motivated reasoning. My timelines have broad uncertainties, and I notice I have trouble thinking seriously about the short end of that range. I’ve heard many serious alignment thinkers say “in short timelines we just die”. I don’t think that conclusion is warranted or wise. That thinking just isn’t complete. We haven’t gone through the outs we might play for if we take both alignment challenges and short timelines seriously.
LLMs seem aligned, so maybe we don’t need a plan?
There’s another more optimistic reason to think we don’t need better plans for short timelines. In short timelines, AGI is closely related to current LLMs. LLMs seem pretty well aligned, so current techniques should work for their close descendants, right?
I think this is a forlorn hope. It has been examined far too little to consider a plan. I think it does not hold up, but at the least it deserves careful elucidation. One plan resulting from a contest like Peter proposes might be “why current alignment techniques are adequate for short timelines”. I think elucidating such a plan rather than leaving it as a hope will reveal that’s not true, and that would be an extremely valuable outcome of such a contest. (A realistically optimistic plan for prosaic alignment all the way to real AGI would be better.)
I’d really love to see more careful thought on how current approaches lead to AGI and attempts at alignment, and making good plans requires it. I have been essentially working on that line of thought for the last two years, and becoming increasingly pessimistic on our odds.
I don’t want to stretch your patience or a “quick take”, so I won’t try to explain exactly why that is. But I will share the briefest broad picture.
My shift in optimism is described somewhat in LLM AGI will have memory, and memory changes alignment and Problems with instruction-following as an alignment target. But I haven’t finished writing up my latest turn toward pessimism. In brief: I think incentives to perform long time-horizon tasks will produce models that can reason and learn to some degree, and whether they do so poorly or very well it creates entirely new alignment challenges.
Poor reasoning about goals will produce chaos, as they adopt new goals through bad reasoning. There are mechanisms that could stabilize goals in LLM agents and prevent them from reasoning about their goals. But these seem failure-prone in my limited analysis—which is the most thorough I can find.
Excellent reasoners present a different alignment problem: they will shift from LLMs’ current habitual behavior to fully goal-directed behavior (in neuroscience terms). This is a very different mode of choosing actions that can produce completely different behavior in humans and animals. I have previously hoped that this transition will be a continuous shift, so that agents will adopt explicit, reasoned goals consistent with their past habitual behavior (because past behavior and beliefs create an attractor in goal-space). I now consider this unlikely. So I now take very seriously the longstanding concerns about inner misalignment. I think we have very poor theories of how goals work and how training instills them. And I now think that the good behavior of LLMs to date is almost no evidence that the desired goals have truly been instilled in the system in a way that will survive their likely transition to capable, reflective reasoners.
I hope to see these fears proven unfounded, or to see alternative plans for alignment on short timelines that can be realistically implemented. That’s why I’m desperate for more planning efforts. I hope you’ll go fund Peter’s project, start your own, talk to colleagues about short timeline plans, or at least think about how we get more strong planning efforts in short order.
- peterr 13 Jul 2025 18:03 UTC
  9 points
  0
  Parent
  Thanks Seth! I appreciate you signal boosting this and laying out your reasoning for why planning is so critical for AI safety.

Seth Herd comments on Seth Herd’s Shortform

We Need Better Plans for Short Timelines:

Alignment needs to be more efficient than other sciences, particularly if time is short.

LLMs seem aligned, so maybe we don’t need a plan?