Steven Byrnes comments on Can we safely automate alignment research?

Steven Byrnes 1 May 2025 18:09 UTC
LW: 6 AF: 4
3
AF
However: I expect that AIs capable of causing a loss of control scenario, at least, would also be capable of top-human-level alignment research.
Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?
Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be, so this might or might not be important. E.g. imagine a scenario where FOOM is possible but too dangerous for humans to allow. If so, that would be an astronomical alignment tax!)
- Joe Carlsmith 1 May 2025 22:50 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Sure, maybe there’s a band of capability where you can take over but you can’t do top-human-level alignment research (and where your takeover plan doesn’t involve further capabilities development that requires alignment). It’s not the central case I’m focused on, though.
  Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not.
  Is the thought here that the AIs trying to takeover aren’t improving their capabilities in a way that requires paying an alignment tax? E.g. if the tax refers to a comparison between (a) rushing forward on capabilities in a way that screws you on alignment vs. (b) pushing forward on capabilities in a way that preserves alignment, AIs that are fooming will want to do (b) as well (though they may have an easier time of it for other reasons). But if it refers to e.g. “humans will place handicaps on AIs that they need to ensure are aligned, including AIs they’re trying to use for alignment research, whereas rogue AIs that have also freed themselves human control will also be able to get rid of these handicaps,” then yes, that’s an advantage the rogue AIs will have (though note that they’ll still need to self-filtrate etc).