Dumbing Down Human Values

I want to preface everything here by acknowledging my own ignorance. I have relatively little formal training in any of the subjects this post will touch upon and that this chain of reasoning is very much a work in progress.

I think the question of how to encode human values into non-human decision makers is a really important research question. Whether or not one accepts the rather eschatological arguments about the intelligence explosion, the coming singularity, etc. there seems to be tremendous interest in the creation of software and other artificial agents that are capable of making sophisticated decisions. Inasmuch as the decisions of these agents have significant potential impacts, we want those decisions to be made with some sort of moral guidance. Our approach towards the problem of creating machines that preserve human values thus far has primarily relied on a series of hard-coded heuristics, e.g. saws that stop spinning if they come into contact with human skin. For very simple machines, these sorts of heuristics are typically sufficient, but they constitute a very crude representation of human values.

We’re at the border, in many ways, of creating machines where these sorts of crude representations are probably not sufficient. As a specific example, IBM’s Watson is now designing treatment programs for lung cancer patients. The design of a treatment program implies striking a balance between treatment cost, patient comfort, aggressiveness of targeting the primary disease, short and long-term side effects, secondary infections, etc. It isn’t totally clear how those trade-offs are being managed, although there’s still a substantial amount of human oversight/​intervention at this point.

The use of algorithms to discover human preferences is already widespread. While these typically operate in restricted domains such as entertainment recommendations, it seems at least in principle possible that with the correct algorithm and a sufficiently large corpus of data, a system not dramatically more advanced than existing technology could learn some reasonable facsimile of human values. This is probably worth doing.

The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/​learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.

Computer chess benefited immensely from 5 decades of work before Deep Blue managed to win a game against Kasparov. While many of the algorithms developed for computer chess have found applications outside of that domain, some of them are domain specific. A specialist human value learning system may also require substantial effort on domain specific problems. The history, competitive nature, and established ranking system for chess made it attractive problem for computer scientists because it was relatively easy to measure progress. Perhaps the goal for a program designed to understand human values would be that it plays a convincing game of “Would you rather?” although as far as I know no one has devised an ELO system for it.

Similarly, a relatively dumb but more general AI, may require relatively large, preferably somewhat homogeneous data sets to come to conclusions that are even acceptable. Having successive generations of AI train on the same or similar data sets could provide a useful way of tracking progress/​feedback mechanism for determining how successful various research efforts are.

The benefit of this research approach is that not only is it a relatively safe path towards a possible AGI, in the event that the speculative futures of mind-uploads and superintelligences do not take place, there’s still substantial utility in having devised a system that is capable of making correct moral decisions in limited domains. I want my self-driving car to make a much larger effort to avoid a child in the road than a plastic bag. I’d be even happier if it could distinguish between an opossum and someone’s cat.

When I design research projects, one of the things I try to ensure is that if some of my assumptions are wrong, the project fails gracefully. Obviously it’s easy to love the Pascal’s Wager-like impact statement of FAI, but if I were writing it up for an NSF grant I’d put substantially more emphasis on the importance of my research even if fully human level AI isn’t invented for another 200 years. When I give the elevator pitch version of FAI, I’ve found placing a strong emphasis on the near future and referencing things people have encountered before such as computers playing jeopardy or self-driving cars makes them much more receptive to the idea of AI safety and allows me to discuss things like the potential for an unfriendly superintelligence without coming across as a crazed prophet of the end times.

I’m also just really really curious to see how well something like Watson would perform if I gave it a bunch of sociology data and asked if a human would rather find 5 dollars or stub a toe. There doesn’t seem to be a huge categorical difference between the being able to answer the Daily Double and reasoning about human preferences, but I’ve been totally wrong about intuitive jumps that seemed much smaller than that one in the past, so it’s hard to be too confident.