RogerDearnaley comments on How Hard a Problem is Alignment?

RogerDearnaley 14 Mar 2026 17:05 UTC
2 points
0
AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can understand it better than you can has a huge tactical advantage.

I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:

a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what’s out-of-distribution and act suitable cautiously: Seth Herd’s “Do What I Mean (and Check)” Corrigible alignment is basically this.

b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei ’s writing, and indeed Claude’s Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.

Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.

So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.

So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
- TAG 16 Mar 2026 19:35 UTC
  2 points
  0
  Parent
  
  I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control
  
  There’s a very basic difference between the people who believe in SLT’S, rapid RSI , etc and those who don’t, and it affects their unspoken assumptions and semantics. The thing where it affects their senstuvs is a problem.
  
  corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach
  
  I don’t se why.
  
  b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies
  
  Agreed. I didn’t say so explicitly, but I was mainly concerned with Everybody Dies scenarios. I think a multipolar scenario where ASI are controllable and controlled by powerful interests is highly likely, but not completely fatal.
  
  Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
  
  Ok, but that’s a different complaint to “it’s not even possible”. Also , the market is for agents that work for you, but do your own things. That’s a point against the standard Doom argument with a sovereign AI killing everyone for it’s own reasons.
  
  So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
  
  And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.
  
  So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
  
  Neither alignment nor safety is a simple binary
  - RogerDearnaley 16 Mar 2026 21:09 UTC
    2 points
    0
    Parent
    Sounds like we’re mostly in agreement!