I agree with the distinction you’re drawing.
There are two failure modes: models that don’t try to solve alignment, and models that try but simply aren’t capable of solving the hard parts. The first one is visible and easy to diagnose. The second one is quieter and, in my view, the more dangerous failure mode because it produces solutions that look correct to both humans and the model itself.
My point isn’t that we should hand off alignment to AI. It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and we shouldn’t confuse one for the other.
Looking at the examples in the OP, I’m trying to point at a distinction that feels important.
There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.