Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he’s making progress.
I don’t want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massivelyoverconfident claim about ELK and debate that I don’t believe is justified by popular theoretical worst-case objections. I think a common cause of “worlds where iterative design fails” is “iterative design seems hard and we stopped at the first apparent hurdle.” Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., “no-go theorems”); but that is not the case here. If I felt confident that the theory-level objections to ELK and debate ruled out hodge-podge solutions, I would abandon hope in these research directions and drop them from the portfolio. But this “flinching away” would ensure that genuine progress on these thorny problems never happened. If Stephen Casper, et al. treated ELK as unsolvable, they would never have published “Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs”. If Akbir Khan, et al. treated debate as unsolvable, they would never have published “Debating with More Persuasive LLMs Leads to More Truthful Answers.” I consider these papers genuine progress towards hodge-podge alignment, which I consider a viable strategy towards “alignment MVPs.” Many more such cases.
A crucial part of my worldview is that good science does not have to “begin with the entire end in mind.” Exploratory research pushing the boundaries of the current paradigm might sometimes seem contraindicated by theoretical arguments, but achieve empirical results anyways. The theory-practise gap can cut both ways: sometimes sound-seeming theoretical arguments fail to predict tractable empirical solutions. Just because travelling salesman is NP-hard doesn’t mean that a P-time algorithm for a restricted subclass of the problem doesn’t exist! I do not feel sufficiently confident in the popular criticisms of dominant prosaic AI safety strategies to rule them out; far from it. I want new researchers to download these criticisms, as there is valuable insight here, but “touch reality” anyways, rather than flinch away from the messy work of iteratively probing and steering vast, alien matrices.
Maybe research directions like ELK and debate sizzle out because the theoretical objections hold up in practise. Maybe they sizzle out for unrelated reasons, such as weird features of neural nets or just because we don’t have time for the requisite empirical tinkering. But we would never find out unless we tried! And we would never equip a generation of empirical researchers to iterate on candidate solutions until something breaks. It’s this skill that I think is most lacking in alignment: turning theoretical builder-breaker moves into empirical experiments on frontier LLMs (the apparent most likely substrate of first-gen AGI) that move us iteratively closer to a sufficiently scalable hodge-podge alignment MVP.
A crucial part of the “hodge-podge alignment feedback loop” is “propose new candidate solutions, often grounded in theoretical models.” I don’t want to entirely focus on empirically fleshing out existing research directions to the exclusion of proposing new candidate directions. However, it seems that, often, new on-paradigm research directions emerge in the process of iterating on old ones!
“Playing theoretical builder-breaker” is an important skill and I think this should be taught more widely. “Iterators,” as I conceive of them, are capable of playing this game well, in addition to empirically testing these theoretical insights against reality. John, to his credit, did a great job of emphasizing the importance of this skill with his MATS workshops on the alignment game tree and similar.
I don’t want to entirely trust in alignments MVPs, so I strongly support empirical research that aims to show the failure modes of this meta-strategy. I additionally support the creation of novel strategic paradigms, though I think this is quite hard. IMO, our best paradigm-level insights as a field largely come from interdisciplinary knowledge transfer (e.g., from economics, game theory, evolutionary biology, physics), not raw-g ideas from the best physics postdocs. Though I wouldn’t turn away a chance to create more von Neumann’s, of course!
Yeah, the worst-case ELK problem could well have no solution, but in practice alignment is solvable either by other methods or by having an ELK solution that does work on a large classes of AIs like neural nets, so Alice is plausibly making a big mistake, and a crux here is that I don’t believe we will ever get no-go theorems, or even arguments to the standard level of rigor in physics because I believe alignment has pretty lax constraints, so a lot of solutions can appear.
The relevant sentence below:
Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., “no-go theorems”); but that is not the case here.
I don’t want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massively overconfident claim about ELK and debate that I don’t believe is justified by popular theoretical worst-case objections. I think a common cause of “worlds where iterative design fails” is “iterative design seems hard and we stopped at the first apparent hurdle.” Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., “no-go theorems”); but that is not the case here. If I felt confident that the theory-level objections to ELK and debate ruled out hodge-podge solutions, I would abandon hope in these research directions and drop them from the portfolio. But this “flinching away” would ensure that genuine progress on these thorny problems never happened. If Stephen Casper, et al. treated ELK as unsolvable, they would never have published “Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs”. If Akbir Khan, et al. treated debate as unsolvable, they would never have published “Debating with More Persuasive LLMs Leads to More Truthful Answers.” I consider these papers genuine progress towards hodge-podge alignment, which I consider a viable strategy towards “alignment MVPs.” Many more such cases.
A crucial part of my worldview is that good science does not have to “begin with the entire end in mind.” Exploratory research pushing the boundaries of the current paradigm might sometimes seem contraindicated by theoretical arguments, but achieve empirical results anyways. The theory-practise gap can cut both ways: sometimes sound-seeming theoretical arguments fail to predict tractable empirical solutions. Just because travelling salesman is NP-hard doesn’t mean that a P-time algorithm for a restricted subclass of the problem doesn’t exist! I do not feel sufficiently confident in the popular criticisms of dominant prosaic AI safety strategies to rule them out; far from it. I want new researchers to download these criticisms, as there is valuable insight here, but “touch reality” anyways, rather than flinch away from the messy work of iteratively probing and steering vast, alien matrices.
Maybe research directions like ELK and debate sizzle out because the theoretical objections hold up in practise. Maybe they sizzle out for unrelated reasons, such as weird features of neural nets or just because we don’t have time for the requisite empirical tinkering. But we would never find out unless we tried! And we would never equip a generation of empirical researchers to iterate on candidate solutions until something breaks. It’s this skill that I think is most lacking in alignment: turning theoretical builder-breaker moves into empirical experiments on frontier LLMs (the apparent most likely substrate of first-gen AGI) that move us iteratively closer to a sufficiently scalable hodge-podge alignment MVP.
Some caveats:
A crucial part of the “hodge-podge alignment feedback loop” is “propose new candidate solutions, often grounded in theoretical models.” I don’t want to entirely focus on empirically fleshing out existing research directions to the exclusion of proposing new candidate directions. However, it seems that, often, new on-paradigm research directions emerge in the process of iterating on old ones!
“Playing theoretical builder-breaker” is an important skill and I think this should be taught more widely. “Iterators,” as I conceive of them, are capable of playing this game well, in addition to empirically testing these theoretical insights against reality. John, to his credit, did a great job of emphasizing the importance of this skill with his MATS workshops on the alignment game tree and similar.
I don’t want to entirely trust in alignments MVPs, so I strongly support empirical research that aims to show the failure modes of this meta-strategy. I additionally support the creation of novel strategic paradigms, though I think this is quite hard. IMO, our best paradigm-level insights as a field largely come from interdisciplinary knowledge transfer (e.g., from economics, game theory, evolutionary biology, physics), not raw-g ideas from the best physics postdocs. Though I wouldn’t turn away a chance to create more von Neumann’s, of course!
Yeah, the worst-case ELK problem could well have no solution, but in practice alignment is solvable either by other methods or by having an ELK solution that does work on a large classes of AIs like neural nets, so Alice is plausibly making a big mistake, and a crux here is that I don’t believe we will ever get no-go theorems, or even arguments to the standard level of rigor in physics because I believe alignment has pretty lax constraints, so a lot of solutions can appear.
The relevant sentence below: