Jeremy Gillen comments on Why I don’t believe Superalignment will work

Jeremy Gillen 23 Sep 2025 10:17 UTC
10 points
3
I’ve been meaning to write a post like this for a while. I think your conclusion is right and all of your arguments are true, but I don’t think most of it engages with a strong version of the superalignment argument, so I don’t think a proponent of superalignment-like plans would find this convincing.
When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time. If they do, then the second and fourth arguments can be dismissed.
Stopping self-improvement by restricting actions and very heavily supervising seems maybe doable (if risky^[1]) for near-human-level AI (unless the capabilities profile is uneven in an inconvenient way), so the first argument is also unconvincing.
I think 3 is a strong argument, and I think it’s crazy how bad the proposals I’ve read are (about what research the AIs should do).
I think the strongest argument is that it’ll take a lot of researcher-labour to distinguish good from bad research, so it overall doesn’t speed up research much. This is because our only way to train for good research is to train for research that is very difficult to distinguish from good research, and it takes a lot of labour to avoid being misled by good-looking-but-not-quite-good research. (You do make this argument in several places, but in less detail).
Proponents often say that verification is easier than generation, and this is sometimes true but not true enough for most kinds of research. Verification is also time consuming, and more so the better the AI is at sweeping errors under the rug. And they are already really good at sweeping errors under the rug, I find it really annoying to have an LLM help me with proofs because of how often it makes hard-to-spot mistakes. Often it feels like it slows me down overall.
So you need a lot of researcher-labour for the training signal, and you also need a lot to verify any alignment progress produced. (On top of this, the signal is gonna be super noisy and researchers will disagree about whether research-outputs have merit). Since skilled-researcher-labour will be the bottleneck, it’s not clear that the plan actually accelerates research progress that much. It probably does a bit, but I don’t know how to work out how much, and intuitively it seems unlikely to be more than 3x. It’s plausible it slows it down on net, if mistakes sometimes get past the researchers.
In my experience, this argument about the quantity of labour involved seems to occasionally convince people who believed in superalignment.
1. ^
  Especially risky given that other countries will likely try to steal the AGI.
- Simon Lermen 23 Sep 2025 11:01 UTC
  3 points
  0
  Parent
  Looking forward to reading your take on superalignment. I wanted to get my thoughts out here but would really like there to be a good reference document with all the core counterarguments. When I read Will’s post, I it seemed sad that I didn’t know of a well argued paper or post to point to against superalignment.
  When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time.
  I agree that some of my arguments don’t directly address the best version of the plan, but rather what realistically is actually happening. I do think that proponents should give us some reason why they believe they will have time to implement the plan. I think they should also explain why they don’t think this plan will have negative consequences.
  - Jeremy Gillen 23 Sep 2025 11:20 UTC
    3 points
    0
    Parent
    I think John’s anti-control post is the best I know of