Fabien Roger comments on Automated Alignment is Harder Than You Think

Fabien Roger 28 May 2026 1:12 UTC
11 points
6
I think this paper over-emphasizes the risk of miscalibration and underemphasizes the risk of bad evidence.
When you delegate work to another entity (AI, human, orgs), you usually do not just replace your work by theirs and then propagate updates in the same way you would have done if you trusted them as much as you. You usually update less hard because the fact that they are not as trusted as you is priced-in. And so the sort of argument tree with mistakes in them causing downstream mistakes that this paper presents is not how I expect it to look like in practice.
In practice (in the sort of Slopolis world that this paper focuses on), I expect humans to notice some slop, know that slop exists, but not be able to know exactly where it is, and thus update much less hard on AI-generated results than on results in which they were heavily involved, similar to
- how you would update less hard on the good results of a fancy technique in an academic paper because you know that the actual results are often less good than they seem
- how you would update less on your own results if you did them sloppily
- how people already don’t trust AI-generated code that much and ask for stronger evidence about code correctness than if a human wrote it
While this can result in some miscalibration due to confirmation bias (which can result in people updating on bad evidence), I think the main effect this will have is a large degradation in effective productivity. I expect it to look something like “you have millions of AI-generated paper-size contributions that look as good as human-generated ones, but you don’t trust them that much because they are AI-generated, and so the size of your update away from your priors is only as good as if you had thousands of paper-size human contributions”.
This also means that I expect people to stay closer to their priors than what you might have hoped based on the volume of research that will happen. And so if you have different priors from decision makers, it might feel like they are miscalibrated, when in fact most of the effect size may just be that they had different priors from you and they (correctly) didn’t update that much given that the evidence was bad.