Wei Dai comments on Will humans build goal-directed agents?

Wei Dai 6 Jan 2019 16:32 UTC
LW: 3 AF: 1
0
AF
For the first one, I guess I would use “argument for defense against value drift” instead since you could conceivably use a goal-directed AI to defend against value drift without lock in, e.g., by doing something like Paul Christiano’s 2012 version of indirect normativity (which I don’t think is feasible but maybe there’s something like it that is, like my hybrid approach, if you consider that goal-directed).

For the third one, I guess interpretability is part of it, but a bigger problem is that it seems hard to make a sufficiently trustworthy human overseer even if we could “interpret” them. In other words, interpretability for a human might just let us see exactly why we shouldn’t trust them.