John_Maxwell comments on Alignment As A Bottleneck To Usefulness Of GPT-3

John_Maxwell 22 Jul 2020 19:14 UTC
LW: 2 AF: 1
0
AF
To put it another way:
What semisupervised learning and transfer learning have in common is: You find a learning problem you have a lot of data for, such that training a learner for that problem will incidentally cause it to develop generally useful computational structures (often people say “features” but I’m trying to take more of an open-ended philosophical view). Then you re-use those computational structures in a supervised learning context to solve a problem you don’t have a lot of data for.
From an AI safety perspective, there are a couple obvious ways this could fail:
- Training a learner for the problem with lots of data might cause it to develop the wrong computational structures. (Example: GPT-3 learns a meaning of the word “love” which is subtly incorrect.)
- While attempting to re-use the computational structures, you end up pinpointing the wrong one, even though the right one exists. (Example: computational structures for both “Effective Altruism” and “maximize # grandchildren” have been learned correctly, but your provided x/y pairs which are supposed to indicate human values don’t allow for differentiating between the two, and your system arbitrarily chooses “maximize # grandchildren” when what you really wanted was “Effective Altruism”).
I don’t think this post makes a good argument that we should expect the second problem to be more difficult in general. Note that, for example, it’s not too hard to have your system try to figure out where the “Effective Altruism” and “maximize # grandchildren” theories of how (x, y) arose differ, and query you on those specific data points (“active learning” has 62,000 results on Google Scholar).
Incidentally, I’m most worried about non-obvious failure modes, I expect obvious failure modes to get a lot of attention. (As an example of a non-obvious thing that could go wrong, imagine a hypothetical super-advanced AI that queries you on some super enticing scenario where you become global dictator, in order to figure out if the (x, y) pairs it’s trying to predict correspond to a person who outwardly behaves in an altruistic way, but is secretly an egoist who will succumb to temptation if the temptation is sufficiently strong. In my opinion the key problem is to catalogue all the non-obvious ways in which things could fail like this.)
What links here?
- John_Maxwell's comment on Alignment As A Bottleneck To Usefulness Of GPT-3 by johnswentworth (22 Jul 2020 20:35 UTC; 2 points)
- johnswentworth 22 Jul 2020 20:02 UTC
  LW: 2 AF: 1
  0
  AF Parent
  This is almost, but not quite, the division of failure-modes which I see as relevant. If my other response doesn’t clarify sufficiently, let me know and I’ll write more of a response here.