William_S comments on Can corrigibility be learned safely?

William_S 4 Apr 2018 15:45 UTC
LW: 9 AF: 5
0
AF
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
- Wei Dai 5 Apr 2018 6:53 UTC
  LW: 9 AF: 4
  0
  AF Parent
  
  Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand)
  
  I guess this is in part because that’s how Paul initially described his approach, before coming up with Security Amplification in October 2016. For example in March 2016 I wrote “First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset. Let me know if this is wrong.” and Paul didn’t object to this in his reply.
  
  An additional issue is that even people who intellectually understand the new model might still have intuitions left over from the old one. For example I’m just now realizing that the low-amplification agents in the new scheme must have thought processes and “deliberations” that are very alien, since they don’t have human priors, natural language understanding, values, common sense judgment, etc. I wish Paul had written a post in big letters that said, “WARNING: Throw out all your old intuitions!”