Rohin Shah comments on Two Neglected Problems in Human-AI Safety

Rohin Shah 24 Dec 2018 3:19 UTC
LW: 5 AF: 3
0
AF
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.
so “let alone respond to everything” doesn’t seem to make much sense here.
Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).
I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
Not exactly, but some related things that I’ve picked up from talking to people but don’t have citations for:
- Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don’t help.
- There’s been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
- There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don’t know if anyone has tried combining these with deep learning methods to get the best of both worlds.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn’t engage as much as I do if I weren’t writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.
This is primarily with public writing—if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I’m not sure they’d agree that the problem exists, more that it’s sufficiently plausible that we should think about it.)