Joe Carlsmith comments on Can we safely automate alignment research?

Joe Carlsmith 1 May 2025 22:50 UTC
LW: 4 AF: 3
0
AF
Sure, maybe there’s a band of capability where you can take over but you can’t do top-human-level alignment research (and where your takeover plan doesn’t involve further capabilities development that requires alignment). It’s not the central case I’m focused on, though.
Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not.
Is the thought here that the AIs trying to takeover aren’t improving their capabilities in a way that requires paying an alignment tax? E.g. if the tax refers to a comparison between (a) rushing forward on capabilities in a way that screws you on alignment vs. (b) pushing forward on capabilities in a way that preserves alignment, AIs that are fooming will want to do (b) as well (though they may have an easier time of it for other reasons). But if it refers to e.g. “humans will place handicaps on AIs that they need to ensure are aligned, including AIs they’re trying to use for alignment research, whereas rogue AIs that have also freed themselves human control will also be able to get rid of these handicaps,” then yes, that’s an advantage the rogue AIs will have (though note that they’ll still need to self-filtrate etc).