Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup). So, overall, I’m quite positive on the alignment of current models!
With the preface that I’m far on the pessimistic side of the AI x-risk/doom scale, I’m not sure how to react to a claim that current AI models are “pretty well aligned”. What’s the justification for this assessment? I’m ambivalent between calling them “not aligned in the sense that matters”, i.e. that they’re just insufficiently capable to showcase the senses in which they’re obviously misaligned. Or calling such a claim “not even wrong” for the same reason. Or saying that anything short of transparent perfection is a sign that we’re nowhere ready to call current alignment techniques capable of applying to a superintelligence.
What’s the counterposition? I’m fully convinced that current publically available systems are very capable, but I don’t see how their output can give any positive evidence of alignment, whether in practice or principle.
With the preface that I’m far on the pessimistic side of the AI x-risk/doom scale, I’m not sure how to react to a claim that current AI models are “pretty well aligned”. What’s the justification for this assessment? I’m ambivalent between calling them “not aligned in the sense that matters”, i.e. that they’re just insufficiently capable to showcase the senses in which they’re obviously misaligned. Or calling such a claim “not even wrong” for the same reason. Or saying that anything short of transparent perfection is a sign that we’re nowhere ready to call current alignment techniques capable of applying to a superintelligence.
What’s the counterposition? I’m fully convinced that current publically available systems are very capable, but I don’t see how their output can give any positive evidence of alignment, whether in practice or principle.