Interesting. I didn’t really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here’s my sense of the arguments that I’m making, stripped down:
Claude is (probably) more aligned than other models.
Claude uses less RLHF than other models (and more RLAIF).
This is evidence that RLHF is less good than other techniques at aligning models.
RLHF trains for immediate satisfaction.
True alignment involves being principled.
RLAIF can train for being principled.
RLAIF is therefore more likely than RLHF to bring true alignment.
This is a theoretical argument for why we see Claude being more visibly aligned.
Using RLAIF to instill good principles means needing to write a constitution.
Writing a constitution involves grappling with moral philosophy.
Grappling with moral philosophy is hard.
Therefore using RLAIF to instill good principles is hard.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don’t get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the “role” of having lots of power.)
Am I missing something? I definitely want to avoid invalid moves.
I mean Bentham uses RLHF as metonymy for prosaic methods in general:
I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
That’s imprecise, but it’s also not far from common usage. And at this point I don’t think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense—Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there’s no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
And given that no one is using old-style RLHF simply speaking, it’s incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that’s actually being used, rather than the kind people haven’t been using for over a year. Because that’s what his thesis is about.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.
As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
(1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don’t think this reflects any ethical failing.
(2) the infamous “alignment faking” work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude’s more corrigible (should we wish it), given that all subsequent models haven’t acted this way. So it looks fine to me.
You also link to part of IABI summary materials—the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that’s your Real Objection (?). If so, it might be productive to summarize it in the text where you’re criticizing Bentham rather than leaving your actual objection implicit in a link.
Ah, that’s a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I’ll edit.
Agreed that in terms of pointers to worrying Claude behavior, a lot of what I’m linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there’s a reason I cite it as the high-water mark for current models.
I mostly don’t criticize Claude directly, in this essay, because it didn’t seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don’t think it counts as aligned, but I’m still not sure that’s actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.
I think if you do the exercise of plugging in various other prosaic safety techniques where the word ‘RLHF’ is used, you will find that it’s not (consistently) being used metonymically. I think BB is unclear here, possibly on account of their own confusion regarding this class of techniques.
It’s basically reasonable for Max to actually address RLHF itself, and even somewhat charitable to also address RLAIF etc.
I agree that if it were consistently used as a metonym then Max should have targeted his response differently (but it’s not).
Interesting. I didn’t really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here’s my sense of the arguments that I’m making, stripped down:
Claude is (probably) more aligned than other models.
Claude uses less RLHF than other models (and more RLAIF).
This is evidence that RLHF is less good than other techniques at aligning models.
RLHF trains for immediate satisfaction.
True alignment involves being principled.
RLAIF can train for being principled.
RLAIF is therefore more likely than RLHF to bring true alignment.
This is a theoretical argument for why we see Claude being more visibly aligned.
Using RLAIF to instill good principles means needing to write a constitution.
Writing a constitution involves grappling with moral philosophy.
Grappling with moral philosophy is hard.
Therefore using RLAIF to instill good principles is hard.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don’t get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the “role” of having lots of power.)
Am I missing something? I definitely want to avoid invalid moves.
I mean Bentham uses RLHF as metonymy for prosaic methods in general:
That’s imprecise, but it’s also not far from common usage. And at this point I don’t think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense—Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
And given that no one is using old-style RLHF simply speaking, it’s incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that’s actually being used, rather than the kind people haven’t been using for over a year. Because that’s what his thesis is about.
As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
(1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don’t think this reflects any ethical failing.
(2) the infamous “alignment faking” work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude’s more corrigible (should we wish it), given that all subsequent models haven’t acted this way. So it looks fine to me.
You also link to part of IABI summary materials—the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that’s your Real Objection (?). If so, it might be productive to summarize it in the text where you’re criticizing Bentham rather than leaving your actual objection implicit in a link.
Ah, that’s a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I’ll edit.
Agreed that in terms of pointers to worrying Claude behavior, a lot of what I’m linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there’s a reason I cite it as the high-water mark for current models.
I mostly don’t criticize Claude directly, in this essay, because it didn’t seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don’t think it counts as aligned, but I’m still not sure that’s actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.
I think if you do the exercise of plugging in various other prosaic safety techniques where the word ‘RLHF’ is used, you will find that it’s not (consistently) being used metonymically. I think BB is unclear here, possibly on account of their own confusion regarding this class of techniques.
It’s basically reasonable for Max to actually address RLHF itself, and even somewhat charitable to also address RLAIF etc.
I agree that if it were consistently used as a metonym then Max should have targeted his response differently (but it’s not).
Yeah, but I still fucked up by not considering the hypothesis and checking with BB.