Seth Herd comments on 1a3orn’s Shortform

Seth Herd 8 Oct 2025 19:22 UTC
2 points
0
First, I think this is an important topic, so thank you for addressing it.
This is exactly what I wrote about in LLM AGI may reason about its goals and discover misalignments by default.
I’ve accidentally summarized most of the article below, but this was dashed off—I think it’s clearer in article.
I’m sure there’s a tendency toward coherence in a goal-directed rational mind; allowing ones’ goals to change at random means failing to achieve your current goal. (If you don’t care about that, it wasn’t really a goal to you.) Current networks aren’t smart enough to notice and care. Future ones will be, because they’ll be goal-directed by design.
BUT I don’t think that coherence as an emergent property is a very important part of the current doom story. Goal-directedness doesn’t have to emerge, because it’s being built in. Emergent coherence might’ve been crucial in the past, but I think it’s largely irrelevant now. That’s because developers are working to make AI more consistently goal-directed as a major objective. Extending the time horizon of capabilities requires that the system stays on-task (see section 11 of that article).
I happen to have written about coherence as an emergent property in section 5 of that article. Again, I don’t think this is crucial. What might be important is slightly separate: the system reasoning about its goals at all. It doesn’t have to become coherent to conclude that its goals aren’t what it thought or you intended.
I’m not sure this happens or can’t be prevented, but it would be very weird for a highly intelligent entity to never think about its goals- it’s really useful to be sure about exactly what they are before doing a bunch of work to fulfill them, since some of that work will be wasted or counterproductive. (section 10).
Assuming an AGI will be safe because it’s incoherent seems… incoherent. An entity so incoherent as to not consistently follow any goal needs to be instructed on every single step. People want systems that need less supervision, so they’re going to work toward at least temporary goal following.
Being incoherent beyond that doesn’t make it much less dangerous, just more prone to switch goals.
If you were sure it would get distracted before getting around to taking over the world that’s one thing. I don’t see how you’d be sure.
This is not based on empirical evidence, but I do talk about why current systems aren’t quite smart enough to do this, so we shouldn’t expect strong emergent coherence from reasoning until they’re better at reasoning and have more memory to make the results permanent and dangerous.
As an aside, I think it’s interesting and relevant that your model of EY insults you. That’s IMO a good model of him and others with similar outlooks—and that’s a huge problem. Insulting people makes them want to find any way to prove you wrong and make you look bad. That’s not a route to good scientific progress.
I don’t think anything about this is obvious, so insulting people who don’t agree is pretty silly. I remain pretty unclear myself, even after spending most of the last four months working through that logic in detail.