Joe Rogero comments on LLMs are badly misaligned

Joe Rogero 7 Oct 2025 21:14 UTC
5 points
2
On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can’t always claim to understand them well.
On the other, one of my intuitions is that if you’re trying to build a Moon rocket, and the rocket engineers keep saying things like “The arguments boil down to differing intuitions” and “I think it is quite accurate to say that we don’t understand how [rockets] work” then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify.
If they don’t, I would claim the correct response is not “maybe it’ll work, maybe it won’t, maybe it’ll get partway there,” it’s instead “wow that rocket is doomed.”
I see the current science being leveled at making Claude “nice” and I go “wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don’t see them sticking the landing this way.”
It’s really hard to shake this intuition.
Possibly a nitpick: So, I don’t actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less “get human to push button” and RLAIF is the same but with an AI. Sure, pushing the “this is better” button often involves a judgment according to some interpretation of a statement like “which of these is more harmless?”, but the appearance of harmlessness is not the same as its reality, etc.
- Seth Herd 8 Oct 2025 4:01 UTC
  5 points
  0
  Parent
  I mostly agree. “It might work but probably not that well even if it does” is not a sane reason to launch a project. I guess optimists would say that’s not what we’re doing, so let’s steelman it a bit. The actual plan (usually implicit because optimists don’t usually wants to say this out loud) is probably something like “we’ll figure it out as we get closer!” and “we’ll be careful once it’s time to be careful!”
  Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that’s a real possibility.
  To be generous once again, I guess the steelman argument would be that we aren’t yet at risk of creating misaligned AGI, so it’s not that dangerous to get a little closer. I think this is a richer discussion, but that we’re already well into the danger zone. We might be so close to AGI that it’s practically impossible to permanently stop someone from reaching it. That’s a minority opinion, but it’s really hard to guess how much progress is too much to stop.
  I’m finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone’s got opinions, but trying to get closer to the truth and the shape of the distributions across “big picture space” seems useful.
  I think you and I probably are pretty close together in our individual estimate, so I’m not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I’d like to write about this and haven’t felt prepared to do so; this is a good warmup.
  To respond to that nitpick: I think the common definition of “alignment target” is what the designers are trying to do with whatever methods they’re implementing. That’s certainly how I use it. It’s not the reward function; that’s an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I’m using it that way. There are lots of ways to miss your target, but it’s good to be able to talk about what you’re shooting at as well as what you’ll hit.