Shmi comments on Towards an empirical investigation of inner alignment

Shmi 24 Sep 2019 2:21 UTC
3 points
0
This “proxy fireworks” where expanding the system causes various proxies to visibly split and fly in different directions is definitely a good intuitive way to understand some of the AI alignment issues.
What I am wondering is whether it is possible to specify anything but a proxy, even in principle. After all, humans as general intelligences fall into the same trap of optimizing proxies (instrumental goals) instead of the terminal goals all the time. We are also known for our piss-poor corrigibility properties. Certainly the maze example or a clean room example are simple enough, but once we ramp up the complexity, odds are that the actual optimization proxies start showing up. And the goal is for an AI to be better at discerning and avoiding “wrong” proxies than any human, even though humans are apt to deny that they are stuck optimizing proxies despite the evidence to the contrary. The reaction (except in PM from Michael Vassar) to an old post of mine is a pretty typical example of that. So, my expectation would be that we would resist to even consider that we optimize a proxy when trying to align a corrigible AI.