Nathan Helm-Burger comments on A Case for the Least Forgiving Take On Alignment

Nathan Helm-Burger 2 May 2023 23:34 UTC
7 points
0
I agree with some of this, although I’m doubtful that the transition from sub-AGI to AGI is as sharp as outlined. I don’t think that’s impossible though, and I’d rather not take the risk. I do think it’s possible to dumb down an AGI if you still have enough control over it to do things like inject noise into its activations between layers...
I’m hopeful that we can solve alignment iff we can contain and study a true AGI. Here’s a comment I wrote on another post about the assumptions which give me hope we might manage alignment...
It seems to me like one of the cruxes is that there is this rough approximate alignment that we can currently do. It’s rough in the sense that it’s spotty, not covering all cases. It’s approximate in that its imprecise and doesn’t seem to work perfectly even in the cases it covers.
The crux is whether the forecaster expects this rough approximate alignment to get easier and more effective as the model gets more capable, because the model understands what we want better. Or whether it will get harder as the model gets more capable, because the model will cross certain skill thresholds relating to self-awareness and awareness of instrumental goals.
I am in the camp that this will get harder as the model gets more competent. If I were in the ‘gets easier’ camp, then my views would be substantially closer to Rohin’s and Quinton Pope’s and Alex Turner’s more optimistic views.
I am, however, a bit more optimistic than Connor I think. My optimism hinges on a different crux which has come up multiple times when discussing this with less optimistic people having views more like Connor’s or Eliezer’s or Nate Soares’.
This crux which gives me an unusual amount of optimism depends on three hopes.
First is that I believe it is possible to safely contain a slightly-superintelligent AGI in a carefully designed censored training simulation on a high security compute cluster.
Second is that I also think that we will get non-extinction level near-misses before we have a successfully deceptive AGI, and that these will convince the leading AI labs to start using more thorough safety precautions. I think there are a lot of smart people currently in the camp of “I’ll believe it when I see it” for AGI risk. It is my hope that they will change their minds and behaviors quickly once they do see real world impacts.
Third is that we can do useful alignment experimentation work on the contained slightly-superhuman AGI without either accidentally releasing it or fooling ourselves into thinking we’ve fixed the danger without truly having fixed it. This gives us the opportunity to safely iterate gradually towards success.
Obviously, all three of these are necessary for my story of an optimistic future to go well. A failure of one renders the other two moot.
Note that I expect an adequate social response would include bureaucratic controls adequate to prevent reckless experimentation on the part of monkeys overly fascinated by the power of the poisoned banana.