Steven Byrnes comments on Can we safely automate alignment research?

Steven Byrnes 1 May 2025 18:09 UTC
LW: 6 AF: 4
0
AF
You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.
I’m curious if you could name some examples, from your perspective?
I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further difficulties after that.
…Well, actually, I’m not so sure. I feel like I’m confused about what “safely automating top-human-level alignment research” actually means. You say that it’s less than “handoff”. But if humans are still a required part of the ongoing process, then it’s not really “automation”, right? And likewise, if humans are a required part of the process, then an alignment MVP is insufficient for the ability to turn tons of compute into tons of alignment research really fast, which you seem to need for your argument.
You also talk elsewhere about “performing more limited tasks aimed at shorter-term targets”, which seems to directly contradict “performs all the cognitive tasks involved in alignment research at or above the level of top human experts”, since one such cognitive task is “making sure that all the pieces are coming together into a coherent viable plan”. Right?
Honestly I’m mildly concerned that an unintentional shell game might be going on regarding which alignment work is happening before vs. after the alignment MVP.
Relatedly, this sentence seems like an important crux where I disagree: “I am cautiously optimistic that for building an alignment MVP, major conceptual advances that can’t be evaluated via their empirical predictions are not required.” But again, that might be that I’m envisioning a more capable alignment MVP than you are.
What links here?
- Steven Byrnes's comment on Can we safely automate alignment research? by Joe Carlsmith (1 May 2025 18:08 UTC; 4 points)
- Joe Carlsmith 1 May 2025 21:33 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Re: examples of why superintelligences create distinctive challenges: superintelligences seem more likely to be schemers, more likely to be able to systematically and successfully mess with the evidence provided by behavioral tests and transparency tools, harder to exert option-control over, better able to identify and pursue strategies humans hadn’t thought of, harder to supervise using human labor, etc.
  If you’re worried about shell games, it’s OK to round off “alignment MVP” to hand-off-ready AI, and to assume that the AIs in question need to be able to pursue make and pursue coherent long-term plans.^[1] I don’t think the analysis in the essay alters that much (for example, I think very little rests on the idea that you can get by with myopic AIs), and better to err on the side of conservatism.
  I wanted to set aside “hand-off” here because in principle, you don’t actually need to hand-off until humans stop being able to meaningfully contribute to the safety/quality of the automated alignment work, which doesn’t necessarily need to start around the time we have AIs capable of top-human-level alignment work (e.g., human evaluation of the research, or involvement in other aspects of control—e.g., providing certain kinds of expensive, trusted supervision—could persist after that). And when exactly you hand-off depends on a bunch of more detailed, practical trade-offs.
  As I said in the post, one way that humans still being involved might not bottleneck the process is if they’re only reviewing the work to figure out whether there’s a problem they need to actively intervene on.
  even if human labor is still playing a role in ensuring safety, it doesn’t necessarily need to directly bottleneck the research process – or at least, not if things are going well. For example: in principle, you could allow a fully-automated alignment research process to proceed forward, with humans evaluating the work as it gets produced, but only actively intervening if they identify problems.
  And I think you can still likely radically speed up and scale up your alignment research even e.g. you still care about humans reviewing and understanding the work in question.
  1. ^
    Though for what it’s worth I don’t think that the task of assessing “is this a good long-term plan for achieving X” needs itself to involve long-term optimization for X. For example: you could do that task over five minutes in exchange for a piece of candy.