Rohin Shah comments on Will humans build goal-directed agents?

Rohin Shah 6 Jan 2019 11:55 UTC
LW: 2 AF: 1
0
AF
As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?
- Wei Dai 6 Jan 2019 16:32 UTC
  LW: 3 AF: 1
  0
  AF Parent
  For the first one, I guess I would use “argument for defense against value drift” instead since you could conceivably use a goal-directed AI to defend against value drift without lock in, e.g., by doing something like Paul Christiano’s 2012 version of indirect normativity (which I don’t think is feasible but maybe there’s something like it that is, like my hybrid approach, if you consider that goal-directed).
  
  For the third one, I guess interpretability is part of it, but a bigger problem is that it seems hard to make a sufficiently trustworthy human overseer even if we could “interpret” them. In other words, interpretability for a human might just let us see exactly why we shouldn’t trust them.