Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?
I really like this framing and question.
My model of Anthropic says their answer would be: We don’t know exactly which techniques work until when or how fast capabilities evolve. So we will continuously build frontier models and align them.
This assumes at least a chance that we could iteratively work our way through this. I think you are very skeptical of that. To the degree that we cannot, this approach (and to a large extent OpenAI’s) seem pretty doomed.
Given the quote in the post, this is not really what they claim. They say (bold mine):
So on that dataset, I assume it might be true although “in the wild” it’s not.