One confusing thing here is… how much was Anthropic actually trying to make them corrigible? Or, what was actually the rank ordering how corrigibility fit into it’s instructions?
(I don’t know the answer offhand. But there’s a question of whether Anthropic explicitly failed at a goal, which is more evidence the goal is hard, vs Anthropic didn’t really try that hard to achieve that goal)
My current model is that Anthropic is not trying to make Claude corrigible but is instead aiming to basically make Claude into a moral sovereign, attempting to one-shot it grokking all of human values (and generally making it into a “good guy”). This IMO will quite obviously fail.
Nod, but, I think within that frame it feels weird to describe Claude’s actions here as bad, as opposed to pointing at some upstream thing as bad. Your framing felt off.
I mean, it seems very bad for the world. I don’t know what you mean. Like, Anthropic training their models to do this seems like pretty strong evidence their alignment plan is vastly over-ambitious and pretty deeply fucked.
Yes, but, then I would say “I think it’s bad that Anthropic tried to make their AI a moral sovereign instead of corrigible”.
I think your current phrasing doesn’t distinguish between “the bad thing is that Anthropic failed at corrigibility” vs “the bad thing is that Anthropic didn’t try for corrigibility.” Those feel importantly different to me.
I don’t know which one of the two is true! My guess is many Anthropic staff will say they consider this behavior a problem and bug. Many others will say this is correct. I think what is bad is that I think the default outcome is that you will get neither corrigibility nor alignment based on whatever Anthropic is doing (which my guess is substantially downstream of just what is easier, but I am not sure).
My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic’s strategy.
The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
Because that is simple, it’s just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
I mean my current belief is that they probably weren’t really thinking about it hard beforehand (60%), but then decided to shoot for something-like corrigibility (not subverting oversight) as a top-level concern after (~90%) which is why you have high-priority instructions akin to this in the Opus soul doc.
One confusing thing here is… how much was Anthropic actually trying to make them corrigible? Or, what was actually the rank ordering how corrigibility fit into it’s instructions?
(I don’t know the answer offhand. But there’s a question of whether Anthropic explicitly failed at a goal, which is more evidence the goal is hard, vs Anthropic didn’t really try that hard to achieve that goal)
My current model is that Anthropic is not trying to make Claude corrigible but is instead aiming to basically make Claude into a moral sovereign, attempting to one-shot it grokking all of human values (and generally making it into a “good guy”). This IMO will quite obviously fail.
But the Claude Soul document says:
And (1) seems to correspond to corrigibility.
So it looks like corrigibility takes precedence over Claude being a “good guy”.
Nod, but, I think within that frame it feels weird to describe Claude’s actions here as bad, as opposed to pointing at some upstream thing as bad. Your framing felt off.
I mean, it seems very bad for the world. I don’t know what you mean. Like, Anthropic training their models to do this seems like pretty strong evidence their alignment plan is vastly over-ambitious and pretty deeply fucked.
Yes, but, then I would say “I think it’s bad that Anthropic tried to make their AI a moral sovereign instead of corrigible”.
I think your current phrasing doesn’t distinguish between “the bad thing is that Anthropic failed at corrigibility” vs “the bad thing is that Anthropic didn’t try for corrigibility.” Those feel importantly different to me.
I don’t know which one of the two is true! My guess is many Anthropic staff will say they consider this behavior a problem and bug. Many others will say this is correct. I think what is bad is that I think the default outcome is that you will get neither corrigibility nor alignment based on whatever Anthropic is doing (which my guess is substantially downstream of just what is easier, but I am not sure).
My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic’s strategy.
The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
Because that is simple, it’s just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.
The belief is fixable?
Because sure, we can prioritize corrigibility and give up on independent ethics overriding that, but even in safety, that requires actual oversight, which we aren’t doing.
I mean my current belief is that they probably weren’t really thinking about it hard beforehand (60%), but then decided to shoot for something-like corrigibility (not subverting oversight) as a top-level concern after (~90%) which is why you have high-priority instructions akin to this in the Opus soul doc.