We don’t have “good” outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this:
Place a weak prior over goal space, such that there already is a bias towards say being helpful.
Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify.
Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss.
After millions of iterations of that loop by SGD, you can get a very aligned agent.
This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this:
Place a weak prior over goal space, such that there already is a bias towards say being helpful.
Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify.
Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss.
After millions of iterations of that loop by SGD, you can get a very aligned agent.
This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.