Thane Ruthenis comments on OpenAI Launches Superalignment Taskforce

Thane Ruthenis 11 Jul 2023 22:29 UTC
10 points
2
I also worry a lot that this is capabilities in disguise, whether or not there is any intent for that to happen. Building a human-level alignment researcher sounds a lot like building a human-level intelligence, exactly the thing that would most advance capabilities, in exactly the area where it would be most dangerous. One more than a little implies the other. Are we sure that is not the important part of what is happening here?
Yeah, that’s been my knee-jerk reaction as well.
Over this year, OpenAI has been really growing in my eyes — they’d released that initial statement completely unprompted, they’d invited ARC to run alignment tests on GPT-4 prior to publishing it, they ran that mass-automated-interpretability experiment, they argued for fairly appropriate regulations at the Congress hearing, Sam Altman confirmed he doesn’t think he can hide from an AGI in a bunker, and now this. You certainly get a strong impression that OpenAI is taking the problem seriously. Out of all possible companies recklessly racing to build a doomsday device we could’ve gotten, OpenAI may be the best possible one.
But then they say something like this, and it reminds you that, oh yeah, they are still recklessly racing to build a doomsday device...
I really want to be charitable towards them, I don’t think any of that is just cynical PR or even safety-washing. But:
- Presenting these sorts of “”“alignment plans””” makes it really hard to not think that.
  - Not even because $P (this plan | OpenAI honestly care)$ is negligible — it’s imaginable that they really believe it can work, and even I am not totally certain none of the variants of this plan can work.
  - It’s just that $P (this plan | OpenAI are just safety-washing)$ is really high. This is exactly the plan I’d expect to see if none of their safety concerns were for real.
- Even if they’re honest and earnest, none of that is going to matter if they just honestly misjudge what the correct approach is.
Nice statements, resource commitments, regulation proposals, etc., is all ultimately just fluff. Costly fluff, not useless fluff, but fluff nonetheless. How you actually plan to solve the alignment problem is the core thing that actually matters — and OpenAI hasn’t yet shown any indication they’d be actually able to pivot from their doomed chosen approach. Or even recognize it for a doomed one. (Indeed, the answer to Eliezer’s “how will you know you’re failing?” is not reassuring.)
Which may make everything else a moot point.
What links here?
- Thane Ruthenis's comment on Ilya Sutskever and Jan Leike resign from OpenAI [updated] by Zach Stein-Perlman (16 May 2024 8:32 UTC; 42 points)
- Thane Ruthenis's comment on Ilya Sutskever and Jan Leike resign from OpenAI [updated] by Zach Stein-Perlman (18 May 2024 22:22 UTC; 7 points)