Second this, if you succeed hard enough at automating empirical alignment research without automating the theoretical / philosophical foundations, probably you end up initiating RSI without having a well-grounded framework which has a chance of holding up to superintelligence. Automated interpretability seems especially likely to cause this kind of failure, even if it has some important benefits.
Second this, if you succeed hard enough at automating empirical alignment research without automating the theoretical / philosophical foundations, probably you end up initiating RSI without having a well-grounded framework which has a chance of holding up to superintelligence. Automated interpretability seems especially likely to cause this kind of failure, even if it has some important benefits.