Rob Bensinger comments on Evaluating the historical value misspecification argument

Rob Bensinger 5 Oct 2023 21:13 UTC
LW: 24 AF: 10
−4
AF
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
Quoting myself in April:
“MIRI’s argument for AI risk depended on AIs being bad at natural language” is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.
E.g., Nate Soares in 2016: https://intelligence.org/files/ValueLearningProblem.pdf
Or Eliezer Yudkowsky in 2008, critiquing his own circa-1997 view “sufficiently smart AI will understand morality, and therefore will be moral”: https://www.lesswrong.com/s/SXurf2mWFw8LX2mkG/p/CcBe9aCKDgT5FSoty
(The response being, in short: “Understanding morality doesn’t mean that you’re motivated to follow it.”)
It was claimed by @perrymetzger that https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes makes a load-bearing “AI is bad at NLP” assumption.
But the same example in https://intelligence.org/files/ComplexValues.pdf (2011) explicitly says that the challenge is to get the right content into a utility function, not into a world-model:
The example does build in the assumption “this outcome pump is bad at NLP”, but this isn’t a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.
It’s true that Eliezer and I didn’t predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.
But the specific update “AI is good at NLP, therefore alignment is easy” requires that there be an old belief like “a big part of why alignment looks hard is that we’re so bad at NLP”.
It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn’t believe it!
Found another example: MIRI’s first technical research agenda, in 2014, went out of its way to clarify that the problem isn’t “AI is bad at NLP”.
What links here?
- Rob Bensinger's comment on Evaluating the historical value misspecification argument by Matthew Barnett (6 Oct 2023 0:54 UTC; 35 points)