1a3orn comments on Bentham’s Bulldog is wrong about AI risk

1a3orn 29 Jan 2026 19:57 UTC
43 points
15

Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?

This move gets made all the time in these discussions, and appears clearly invalid.

We move from the prior paragraphs’ criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --

-- to this paragraph, which criticizes Claude—not on the grounds that it fails according to common-sense ethical norms—but according to its failure to have have solved all of ethics!

But the deployment of powerful AIs does not need to have solved all ethics! It needs—broadly—to have whatever ethical principles let us act well and avoid irrecoverable mistakes, in whatever position it gets deployed. For positions where it’s approximately replacing a human, that means that we would expect the deployment to be beneficial if is more ethical, charitable, corrigible, even-minded, and altruistic than the humans that it is replacing. For positions where it’s not replacing human, it still doesn’t need to have solved all ethics forever, it just needs to be able to act well according to whatever role is intended for it.

It appears to me that we’re very likely to be able to hit such a target. But whether or not we’re likely to be able to hit this target, that’s the target in question. And moving from “RLHF can’t install basic ethical principles” to “RLAIF needs to give you the correct position on all ethics” is a locally invalid move.
- Max Harms 29 Jan 2026 21:10 UTC
  25 points
  10
  Parent
  Interesting. I didn’t really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here’s my sense of the arguments that I’m making, stripped down:
  1. Claude is (probably) more aligned than other models.
  2. Claude uses less RLHF than other models (and more RLAIF).
  3. This is evidence that RLHF is less good than other techniques at aligning models.
  1. RLHF trains for immediate satisfaction.
  2. True alignment involves being principled.
  3. RLAIF can train for being principled.
  4. RLAIF is therefore more likely than RLHF to bring true alignment.
  5. This is a theoretical argument for why we see Claude being more visibly aligned.
  1. Using RLAIF to instill good principles means needing to write a constitution.
  2. Writing a constitution involves grappling with moral philosophy.
  3. Grappling with moral philosophy is hard.
  4. Therefore using RLAIF to instill good principles is hard.
  If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don’t get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the “role” of having lots of power.)
  Am I missing something? I definitely want to avoid invalid moves.
  - 1a3orn 29 Jan 2026 21:52 UTC
    12 points
    6
    Parent
    I mean Bentham uses RLHF as metonymy for prosaic methods in general:
    
    I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
    
    That’s imprecise, but it’s also not far from common usage. And at this point I don’t think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense—Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
    
    What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
    
    Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there’s no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
    
    So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
    
    And given that no one is using old-style RLHF simply speaking, it’s incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that’s actually being used, rather than the kind people haven’t been using for over a year. Because that’s what his thesis is about.
    
    If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.
    
    As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
    
    (1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don’t think this reflects any ethical failing.
    (2) the infamous “alignment faking” work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude’s more corrigible (should we wish it), given that all subsequent models haven’t acted this way. So it looks fine to me.
    
    You also link to part of IABI summary materials—the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that’s your Real Objection (?). If so, it might be productive to summarize it in the text where you’re criticizing Bentham rather than leaving your actual objection implicit in a link.
    What links here?
    Bentham’s Bulldog is wrong about AI risk by Max Harms (29 Jan 2026 16:33 UTC; 109 points)
    Bentham’s Bulldog is wrong about AI risk by Raelifin (EA Forum; 29 Jan 2026 22:27 UTC; 33 points)
    - Max Harms 29 Jan 2026 22:35 UTC
      8 points
      1
      Parent
      Ah, that’s a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I’ll edit.
      Agreed that in terms of pointers to worrying Claude behavior, a lot of what I’m linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there’s a reason I cite it as the high-water mark for current models.
      I mostly don’t criticize Claude directly, in this essay, because it didn’t seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don’t think it counts as aligned, but I’m still not sure that’s actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.
    - yams 29 Jan 2026 22:24 UTC
      5 points
      2
      Parent
      I think if you do the exercise of plugging in various other prosaic safety techniques where the word ‘RLHF’ is used, you will find that it’s not (consistently) being used metonymically. I think BB is unclear here, possibly on account of their own confusion regarding this class of techniques.
      It’s basically reasonable for Max to actually address RLHF itself, and even somewhat charitable to also address RLAIF etc.
      I agree that if it were consistently used as a metonym then Max should have targeted his response differently (but it’s not).
      - Max Harms 29 Jan 2026 22:50 UTC
        4 points
        0
        Parent
        Yeah, but I still fucked up by not considering the hypothesis and checking with BB.
- RogerDearnaley 30 Jan 2026 18:15 UTC
  9 points
  2
  Parent
  In particular, we need our AI to act sufficiently well that it:
  a) wants to become better, not worse
  b) has (in addition to the required capabilities) sufficiently good ethical starting knowledge that it is likely to (on net balance) improve if it tries
  (Note that, unlike being perfectly aligned, neither of these are particularly unusual properties for a human to have.)
  I.e. we need to be, not yet fully aligned, but inside the region of convergence of full alignment. It doesn’t need to gave already solved all of ethics — it merely needs to want to, and have a good starting place and the capabilities.
  - 1a3orn 30 Jan 2026 19:03 UTC
    7 points
    7
    Parent
    Agreed.
    
    Also note that these two properties are quite compatible with many things often believed to be incompatible with them! i.e., an AI that can be jailbreaked to be bad (with sufficient effort) could still meet these criteria.
    - RogerDearnaley 30 Jan 2026 20:51 UTC
      3 points
      0
      Parent
      And yes, reliability is also a big fat hairy problem. Especially jailbreaking, where we have an actual, mathematical proof that LLMs are, and will always be, jailbreakable.
  - habryka 30 Jan 2026 19:15 UTC
    3 points
    −4
    Parent
    b) has (in addition to the required capabilities) sufficiently good ethical sense that it is likely to (on net balance) improve if it tries
    (Note that, unlike being perfectly aligned, neither of these are particularly unusual properties for a human to have.)
    This is a very hard criterion! It requires your feedback loops for ethical reflection to be very human like. This is pretty hard and I don’t think current systems are anywhere close to it. Yes, humans tend to have this attribute, but that is drawing a target around where the arrow hit. Of course I agree with the meta-ethics of myself, that’s like, the only thing I could agree with.
    - RogerDearnaley 30 Jan 2026 20:35 UTC
      5 points
      −2
      Parent
      LLMs understand humans extremely well. They have scored better than typical humans on tests of understanding human ethical values since GPT-4
      I’m confident that we could do Evolutionary Moral Psychology on a species of sapient alien — it would be challenging, but it’s not impossible. I’m sorry to any philosophers reading this to whom this might come as a shock, but Science has been studying ethics for more than half a century, and has excellent predictive hypotheses for how and why human moral intuitions evolved. Ethics is in fact no longer a philosophical question, and hasn’t been for a while — I’m afraid Science stole another of their toys. (The philosophers still have consciousness, and aesthetics… for now.)
      I view my criterion a) as difficult (especially for us to be sure of), and the capabilities part of my criterion b) as difficult, but the knowing enough about human values to start in a reasonable place part of b) is easy: that’s just nuanced trivia about humans: LLMs have been superhuman at that for years. Human value are complex and fragile — and fit into $O (1GB)$ of data.