Koen.Holtman comments on Collection of arguments to expect (outer and inner) alignment failure?

Koen.Holtman 9 Oct 2021 11:21 UTC
LW: 2 AF: 2
AF

Not really, unfortunately. In those posts [under the threat models tag], the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place.

I feel that Christiano’s post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.

There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.

Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.

However [...] it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.

Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.