plex comments on We should try to automate AI safety work asap

plex 28 Apr 2025 21:44 UTC
11 points
0
Second this, if you succeed hard enough at automating empirical alignment research without automating the theoretical / philosophical foundations, probably you end up initiating RSI without having a well-grounded framework which has a chance of holding up to superintelligence. Automated interpretability seems especially likely to cause this kind of failure, even if it has some important benefits.