Andrei Alexandru comments on Automated Alignment is Harder Than You Think

Andrei Alexandru 16 Jun 2026 9:19 UTC
1 point
0
Thanks for the comment! I think bad evidence, or the difficulty of knowing how much to trust AI-generated research is probably a core issue with automating the work.

I wonder how much room there is to increase the signal:noise ratio of AI-generated work by building a good interface (such that even if you are still updating less than on the equivalent volume of work from humans, you still get good value-for-money from running lots of parallel agents). I’m assuming a world where lots of (empirical) alignment work can be done by agents; in that world, human researchers are the bottleneck. How do you help those researchers make sense of the work done by agents?

I’ve been prototyping this, but so far I haven’t found a version that I personally find useful. It seems to devolve into imposing a structure: a research “contract” that specifies the hypothesis and work to be delivered, artefacts which have a 1-click replication script, an interface that displays all this together with any shared assumptions (to potentially discount correlated evidence) or resources (e.g. if 3 workstreams use the same dataset, if there is something off about that dataset all 3 are invalidated), and so on. I’m imagining easy replication via other agents is also useful, as well as any other work that would lift the ceiling of what a person can reasonably oversee (pairing the main worker with a “falsifier” agent, LLM as a judge hooked in at various points...).

I think some of these could work. That said, I have a high bar for replacing my default agent-assisted workflow, and this hasn’t cleared it yet. I’m curious if you think there is (and if so, where) value in improving the interaction surface.