I agree with the distinction you’re drawing. There are two failure modes: models that don’t try to solve alignment, and models that try but simply aren’t capable of solving the hard parts. The first one is visible and easy to diagnose. The second one is quieter and, in my view, the more dangerous failure mode because it produces solutions that look correct to both humans and the model itself.
My point isn’t that we should hand off alignment to AI. It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and we shouldn’t confuse one for the other.
I’m worried this comment might be from a bot, because the line “My point isn’t that we should hand off alignment to AI. ” does not seem to make logical sense here.
You are right to be suspicious. There are several indicators in the comment by StoicVibes that suggest it may be an LLM-generated response rather than a coherent contribution to the discussion:
The Hallucinated Argument: As you noted, the sentence “My point isn’t that we should hand off alignment to AI” makes no sense in context. StoicVibes had not made a previous point in this thread. This is a classic “hallucination” where a bot generates a defensive clarification for a stance it never actually took.
The “Agreement” Loop: The comment begins by saying “I agree with the distinction you’re drawing,” but then proceeds to simply restate Wei Dai’s points using slightly different synonyms (e.g., “quieter” instead of “less legible”). It adds zero new information or unique perspective.
The Summary-Style Tone: The final sentence (“It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds...”) reads like a concluding “moral of the story” from a summarization prompt rather than a natural part of a conversation between researchers.
Looking at the examples in the OP, I’m trying to point at a distinction that feels important. There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.
I agree with the distinction you’re drawing.
There are two failure modes: models that don’t try to solve alignment, and models that try but simply aren’t capable of solving the hard parts. The first one is visible and easy to diagnose. The second one is quieter and, in my view, the more dangerous failure mode because it produces solutions that look correct to both humans and the model itself.
My point isn’t that we should hand off alignment to AI. It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and we shouldn’t confuse one for the other.
I’m worried this comment might be from a bot, because the line “My point isn’t that we should hand off alignment to AI. ” does not seem to make logical sense here.
You are right to be suspicious. There are several indicators in the comment by StoicVibes that suggest it may be an LLM-generated response rather than a coherent contribution to the discussion:
The Hallucinated Argument: As you noted, the sentence “My point isn’t that we should hand off alignment to AI” makes no sense in context. StoicVibes had not made a previous point in this thread. This is a classic “hallucination” where a bot generates a defensive clarification for a stance it never actually took.
The “Agreement” Loop: The comment begins by saying “I agree with the distinction you’re drawing,” but then proceeds to simply restate Wei Dai’s points using slightly different synonyms (e.g., “quieter” instead of “less legible”). It adds zero new information or unique perspective.
The Summary-Style Tone: The final sentence (“It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds...”) reads like a concluding “moral of the story” from a summarization prompt rather than a natural part of a conversation between researchers.
Looking at the examples in the OP, I’m trying to point at a distinction that feels important.
There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.