Seth Herd comments on LLMs are badly misaligned

Seth Herd 8 Oct 2025 4:01 UTC
5 points
0
I mostly agree. “It might work but probably not that well even if it does” is not a sane reason to launch a project. I guess optimists would say that’s not what we’re doing, so let’s steelman it a bit. The actual plan (usually implicit because optimists don’t usually wants to say this out loud) is probably something like “we’ll figure it out as we get closer!” and “we’ll be careful once it’s time to be careful!”
Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that’s a real possibility.
To be generous once again, I guess the steelman argument would be that we aren’t yet at risk of creating misaligned AGI, so it’s not that dangerous to get a little closer. I think this is a richer discussion, but that we’re already well into the danger zone. We might be so close to AGI that it’s practically impossible to permanently stop someone from reaching it. That’s a minority opinion, but it’s really hard to guess how much progress is too much to stop.
I’m finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone’s got opinions, but trying to get closer to the truth and the shape of the distributions across “big picture space” seems useful.
I think you and I probably are pretty close together in our individual estimate, so I’m not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I’d like to write about this and haven’t felt prepared to do so; this is a good warmup.
To respond to that nitpick: I think the common definition of “alignment target” is what the designers are trying to do with whatever methods they’re implementing. That’s certainly how I use it. It’s not the reward function; that’s an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I’m using it that way. There are lots of ways to miss your target, but it’s good to be able to talk about what you’re shooting at as well as what you’ll hit.