With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today’s AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There’s an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he’s only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.
Over the past decade I’ve met many AI safety people who speak as though “AI capabilities” and “AI safety/alignment” work is a dichotomy. They talk in terms of wanting to “move” capabilities researchers into alignment. But most concrete alignment work is capabilities work. It’s a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.
“Does this mean you oppose such practical work on alignment?” No! Not exactly. Rather, I’m pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it’s only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn’t the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it’s making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don’t proposition.
I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn’t jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that’s good news compared to AI pessimist views.
This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it’s far less bad to accelerate if misuse is the main concern and can actually look very positive.
This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:
So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.
I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn’t jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that’s good news compared to AI pessimist views.
The tweet is below:
https://twitter.com/QuintinPope5/status/1702554175526084767
This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it’s far less bad to accelerate if misuse is the main concern and can actually look very positive.
This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:
https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns
So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.