One of the most common counterarguments against a lot of alignment research that I hear sounds something like this: “Making current AI do X with Y won’t help, because AGI will RSI to ASI and break everything”. And it.. sounds convincing, but also like an ultimate counterargument? I mean, I agree that aligning ASI takes much more than just RLHF or fine-tuning or data cleaning or whatever else, but does this mean that it’s all pointless? I propose a (probably not novel at all) way of thinking about this.
If we imagine that there is a class of books from the future, such that any of them is a complete “alignment handbook” that contains all the schematics to build aligned AGI, with step-by-step explanations, tips and tricks, etc. Is there another class of books, “catalyst books”, that if presented to us, will increase our chances of writing at least one “alignment handbook”? Because if there are none of them, then yeah, any research that won’t give us “all the right answers” is pointless (also, we are extra fucked). And btw, any “catalyst book” that somehow increases our chances of writing an “alignment handbook” to like 99%, is an “alignment handbook”, practically.
I think there is a whole spectrum of those, from x1.01 books to x100 books. And probably even x0.95 “inhibitor books” as well. Almost all of the books in the world are x1 “neutral books”. So maybe instead of saying “It won’t help at all, because ASI will break everything” (and I almost always agree that it will) it would be better to see every direction as the potential way of producing good, neutral or bad “catalyst book”?