RogerDearnaley comments on Some thoughts on automating alignment research

RogerDearnaley 28 May 2023 9:26 UTC
LW: 1 AF: 1
0
AF
As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like “figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that”), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI’s capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can’t use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.
In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I’d feel more comfortable to have solved by a human alignment researcher than an AGI one.
- Alexander Gietelink Oldenziel 28 May 2023 13:48 UTC
  −13 points
  −7
  Parent
  If we solve the alignment problem than we solve alignment problem.
  
  I agree with this true statement.
  - RogerDearnaley 29 May 2023 4:48 UTC
    11 points
    1
    Parent
    If we can solve enough of the alignment problem, the rest gets solved for us.
    If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the ‘sharp left turn’ is happening, especially if it’s also going Foom. So with value learning, there is is a region of convergence around alignment.
    Or to reuse one of Eliezer’s metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.