Raymond Douglas comments on A Longlist of Theories of Impact for Interpretability

Raymond Douglas 11 Mar 2022 17:58 UTC
28 points
0
A slightly sideways argument for interpretability: It’s a really good way to introduce the importance and tractability of alignment research
In my experience it’s very easy to explain to someone with no technical background that
- Image classifiers have got much much better (like in 10 years they went from being impossible to being something you can do on your laptop)
- We actually don’t really understand why they do what they do (like we don’t know why the classifier says this is an image of a cat, even if it’s right)
- But, thanks to dedicated research, we have begun to understand a bit of what’s going on in the black box (like we know it knows what a curve is, we can tell when it thinks it sees a curve)
Then you say ‘this is the same thing that big companies are using to maximise your engagement on social media and sell you stuff, and look at how that’s going. and by the way did you notice how AIs keep getting bigger and stronger?’
At this point my experience is it’s very easy for people to understand why alignment matters and also what kind of thing you can actually do about it.
Compare this to trying to explain why people are worried about mesa-optimisers, boxed oracles, or even the ELK problem, and it’s a lot less concrete. People seem to approach it much more like a thought experiment and less like an ongoing problem, and it’s harder to grasp why ‘developing better regularisers’ might be a meaningful goal.
But interpretability gives people a non-technical story for how alignment affects their lives, the scale of the problem, and how progress can be made. IMO no other approach to alignment is anywhere near as good for this.