plex comments on An alignment safety case sketch based on debate

plex 8 May 2025 16:02 UTC
5 points
0
This seems like a very reasonable thing for AISI alignment to focus on. It’s shovel ready, has some good properties, probably scales to high enough intelligence that you can get useful research out of them before it breaks, and you’re correctly acknowledging up-front that this won’t be enough to handle full superintelligence.^[1]
One thing I hope you’re tracking, and devote adequate attention to, is figuring out what kinds of research or strategy questions you plan to ask your debate system to try and end the acute risk period. You don’t want to have to do that on the fly at crunch time, and once a good version of this is available to do research for alignment it will be available to do capabilities work too, which means you probably don’t have long before someone sets off RSI.
1. ^
  At high enough power levels simulated or real human judges are the weak point, even if the rest of the protocol holds up.