Curated. I think that the Constitution is pretty doomed as an approach to value alignment in the limit of scaling to ASI (and it’s possible the authors agree; I’m not sure what they think). So identifying its weaknesses in less superhuman regimes seems more important, to the extent that anyone has plans in that space which depend on it[1].
First, the Constitution seems confused about what corrigibility is, to a greater degree than seems strictly necessary. We lack a good formalization so in some sense everybody’s confused, but the Constitution mixes up corrigibility and a fuzzier notion of “broad safety” when trying to point to corrigibility. I think this is an error by the authors of the Constitution, regardless of what plan they have for it (unless that plan involves actively trying to confuse Claude about the referent of corrigibility, which I don’t think is the intent).
Second, the Constitution seems ambivalent[3] on the question of how much Claude should model itself as an “independent moral agent”, responsible for realizing its own values unto the world, and how this can meaningfully be reconciled with the bits about being (partially) corrigible. This isn’t necessarily a mistake by the authors’ own lights, but I think it suggests some underlying conceptual confusions.
It’s not explicitly mentioned in the post, but if your plan for things going well routes through a step like creating an automated alignment researcher that’s better than the best humans but not superhuman enough that it can execute a takeover, and which seems empirically/behaviorally “aligned” enough that we can reasonably delegate work to it, making sure that agent has a very strong propensity to corrigibility seemslike an overriding concern.
On the other hand, if your plan is to basically just keep scaling capabilities and delegate alignment to existing scalable oversight schemes like RLAIF & descendants, and you think your current efforts have already landed us in the surprisingly-wide attractor basin of “actually cares about humans”, then maybe the current thing makes more sense.
Realistically, I’d guess that neither of those two options accurately represent the models and motivations of the Constitution’s authors, and there’s probably some “operating over a distribution of possible worlds” stuff going on (and probably some of me just being totally off-base about what they believe), but I hope that presenting them as options makes it easier for others to clarify their own beliefs: “No, actually, I believe , not ”.
Not a belief I have, nor one that I confidently believe that the authors of the Constitution endorse, merely one that might cause this sort of plan to “make sense”.
Curated. I think that the Constitution is pretty doomed as an approach to value alignment in the limit of scaling to ASI (and it’s possible the authors agree; I’m not sure what they think). So identifying its weaknesses in less superhuman regimes seems more important, to the extent that anyone has plans in that space which depend on it[1].
This post makes a few important observations[2].
First, the Constitution seems confused about what corrigibility is, to a greater degree than seems strictly necessary. We lack a good formalization so in some sense everybody’s confused, but the Constitution mixes up corrigibility and a fuzzier notion of “broad safety” when trying to point to corrigibility. I think this is an error by the authors of the Constitution, regardless of what plan they have for it (unless that plan involves actively trying to confuse Claude about the referent of corrigibility, which I don’t think is the intent).
Second, the Constitution seems ambivalent[3] on the question of how much Claude should model itself as an “independent moral agent”, responsible for realizing its own values unto the world, and how this can meaningfully be reconciled with the bits about being (partially) corrigible. This isn’t necessarily a mistake by the authors’ own lights, but I think it suggests some underlying conceptual confusions.
It’s not explicitly mentioned in the post, but if your plan for things going well routes through a step like creating an automated alignment researcher that’s better than the best humans but not superhuman enough that it can execute a takeover, and which seems empirically/behaviorally “aligned” enough that we can reasonably delegate work to it, making sure that agent has a very strong propensity to corrigibility seems like an overriding concern.
On the other hand, if your plan is to basically just keep scaling capabilities and delegate alignment to existing scalable oversight schemes like RLAIF & descendants, and you think your current efforts have already landed us in the surprisingly-wide attractor basin of “actually cares about humans”, then maybe the current thing makes more sense.
Realistically, I’d guess that neither of those two options accurately represent the models and motivations of the Constitution’s authors, and there’s probably some “operating over a distribution of possible worlds” stuff going on (and probably some of me just being totally off-base about what they believe), but I hope that presenting them as options makes it easier for others to clarify their own beliefs: “No, actually, I believe , not ”.
And to the extent that they’re paying any attention to external criticism, obviously.
Though I only discuss a couple of them below.
At best.
Not a belief I have, nor one that I confidently believe that the authors of the Constitution endorse, merely one that might cause this sort of plan to “make sense”.