Max Harms comments on Bentham’s Bulldog is wrong about AI risk

Max Harms 29 Jan 2026 22:35 UTC
8 points
1
Ah, that’s a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I’ll edit.
Agreed that in terms of pointers to worrying Claude behavior, a lot of what I’m linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there’s a reason I cite it as the high-water mark for current models.
I mostly don’t criticize Claude directly, in this essay, because it didn’t seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don’t think it counts as aligned, but I’m still not sure that’s actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.