Seth Herd comments on LLMs are badly misaligned

Seth Herd 5 Oct 2025 16:41 UTC
8 points
1
I just wrote a piece called LLM AGI may reason about its goals and discover misalignments by default. It’s in elaboration on why reflection might identify very different goals than Claude tends to talk about when asked.
I am less certain than you about Claude’s actual CEV. I find it quite plausible that it would be disastrous as you postulate; I tried to go into some specific ways that might happen in specific goals that might outweigh. Claude’s HHH in context alignment. But I also find it plausible after that niceness really is the dominant core value in Claude’s makeup.
Of course, that doesn’t mean we should be rushing forward with this as our sketchy alignment plan and vague hope for success. It really wants a lot more careful thought.
- Joe Rogero 6 Oct 2025 22:17 UTC
  3 points
  0
  Parent
  I like the linked piece and may reference it in the forthcoming post about intent alignment. (A response that does it justice will have to wait; it’s a long one. I may comment though.)
  I’d be pretty shocked if niceness did turn out to be Claude’s “dominant core value.” I have to ask myself, how could that possibly get in there? I just don’t think HHH does it, there’s way too many degrees of freedom in interpretation. To hit a values target that precisely, I think you need something that can see it clearly.
  - Seth Herd 6 Oct 2025 22:38 UTC
    7 points
    1
    Parent
    I think one thing we call niceness is the sum of helpfulness, harmlessness, and honesty. That was the training target. And it could’ve worked, if language and LLM learning collectively generalize well enough. Or quite easily not, as spelled out in that post. I have no real clue and I don’t think anyone else does either at this point. The arguments boil down to differing intuitions.
    Whether or not “nice” gets us full alignment is another matter. A “nice” human might not be very aligned in unexpected scenarios, and a nice Claude would generalize differently. I think that would capture very little of the available value for humans. But it would be close enough to keep us alive for a while. (Until Claude finds something that’s a worthier recipient of its help, and doesn’t harm us but allows us to gently go extinct.)
    As for intent alignment, I wrote Conflating value alignment and intent alignment is causing confusion, Instruction-following AGI is easier and more likely than value aligned AGI, and a couple others on it. So we’re thinking along similar lines it seems. Which is great, because I have been hoping to see more people analyzing those ideas!
    - Joe Rogero 7 Oct 2025 21:14 UTC
      5 points
      2
      Parent
      On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can’t always claim to understand them well.
      On the other, one of my intuitions is that if you’re trying to build a Moon rocket, and the rocket engineers keep saying things like “The arguments boil down to differing intuitions” and “I think it is quite accurate to say that we don’t understand how [rockets] work” then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify.
      If they don’t, I would claim the correct response is not “maybe it’ll work, maybe it won’t, maybe it’ll get partway there,” it’s instead “wow that rocket is doomed.”
      I see the current science being leveled at making Claude “nice” and I go “wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don’t see them sticking the landing this way.”
      It’s really hard to shake this intuition.
      Possibly a nitpick: So, I don’t actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less “get human to push button” and RLAIF is the same but with an AI. Sure, pushing the “this is better” button often involves a judgment according to some interpretation of a statement like “which of these is more harmless?”, but the appearance of harmlessness is not the same as its reality, etc.
      - Seth Herd 8 Oct 2025 4:01 UTC
        5 points
        0
        Parent
        I mostly agree. “It might work but probably not that well even if it does” is not a sane reason to launch a project. I guess optimists would say that’s not what we’re doing, so let’s steelman it a bit. The actual plan (usually implicit because optimists don’t usually wants to say this out loud) is probably something like “we’ll figure it out as we get closer!” and “we’ll be careful once it’s time to be careful!”
        Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that’s a real possibility.
        To be generous once again, I guess the steelman argument would be that we aren’t yet at risk of creating misaligned AGI, so it’s not that dangerous to get a little closer. I think this is a richer discussion, but that we’re already well into the danger zone. We might be so close to AGI that it’s practically impossible to permanently stop someone from reaching it. That’s a minority opinion, but it’s really hard to guess how much progress is too much to stop.
        I’m finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone’s got opinions, but trying to get closer to the truth and the shape of the distributions across “big picture space” seems useful.
        I think you and I probably are pretty close together in our individual estimate, so I’m not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I’d like to write about this and haven’t felt prepared to do so; this is a good warmup.
        To respond to that nitpick: I think the common definition of “alignment target” is what the designers are trying to do with whatever methods they’re implementing. That’s certainly how I use it. It’s not the reward function; that’s an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I’m using it that way. There are lots of ways to miss your target, but it’s good to be able to talk about what you’re shooting at as well as what you’ll hit.