Wei Dai comments on Can corrigibility be learned safely?

Wei Dai 4 Apr 2018 10:05 UTC
LW: 6 AF: 4
0
AF

it doesn’t seem likely to me that the form of the algorithm-to-be-approximated will suggest a method of approximation that could plausibly be competitive

Aren’t there lots of approximation algorithms that are specific to the problems whose exact solutions they’re trying to approximate? Is there a reason to think that’s unlikely in this case?

I don’t think that e.g. decision theory or naturalized induction (or most other past/current MIRI work) is a good angle of attack on this problem, because a successful system needs to be able to defer that kind of thinking to have any chance and should instead be doing something more like metaphilosphy and deference.

I’ve criticized MIRI for similar reasons in the past, but their current goal is to implement a task-directed AGI and use it to stabilize the world and then solve remaining AI alignment problems at leisure, which makes it more understandable why they’re not researching metaphilosophy at the moment. It seems like a very long shot to me but so do other AI alignment approaches, which makes me not inclined to try to push them to change direction. I think it makes more sense to try to get additional resources and work on the different approaches in parallel.

(As I understand it, MIRI’s strategy requires trying to leapfrog mainstream AI, which would rule out using an approximation scheme that is at best only competitive with it.)
- paulfchristiano 4 Apr 2018 19:43 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Aren’t there lots of approximation algorithms that are specific to the problems whose exact solutions they’re trying to approximate? Is there a reason to think that’s unlikely in this case?
  But in this case we want to be competitive with a particular algorithm (deep RL, evolution, whatever), so we need to find an approximation that is able to leverage the power of the algorithm we want to compete with.