Looking at the examples in the OP, I’m trying to point at a distinction that feels important. There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.
Looking at the examples in the OP, I’m trying to point at a distinction that feels important.
There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.