Loopholes: A window into value alignment and the communication of meaning
Abstract. Intentional misunderstandings take advantage of the ambiguity of language to do what someone said, instead of what they actually wanted. These purposeful misconstruals or loopholes are a familiar facet of fable, law, and everyday life. Engaging with loopholes requires a nuanced understanding of goals (your own and those of others), ambiguity, and social alignment. As such, loopholes provide a unique window into the normal operations of cooperation and communication. Despite their pervasiveness and utility in social interaction, research on loophole behavior is scarce. Here, we combine a theoretical analysis with empirical data to give a framework of loophole behavior. We first establish that loopholes are widespread, and exploited most often in equal or subordinate relationships (Study 1). We show that people reliably distinguish loophole behavior from both compliance and non-compliance (Study 2), and that people predict that others are most likely to exploit loopholes when their goals are in conflict with their social partner’s and there is a cost for non-compliance (Study 3). We discuss these findings in light of other computational frameworks for communication and joint-planning, as well as discuss how loophole behavior might develop and the implications of this work for human–machine alignment.
Introduction
At the height of the Russian revolution of 1917, several thousand Vyborg mill-workers found themselves face-to-face with a Cossack cavalry formation. It was a tense moment. Twelve years earlier a similar standoff ended in bloodshed. This time, when the officers commanded the cavalry to block the marchers, the Cossacks complied perfectly: The cavalry arranged their horses into a blockade, and then stayed still, just as they had been told. They remained un-moving, as the protesters, realizing the cavalry’s intent, ducked under the horses and carried on marching (Miéville, 2017). The cavalry knew exactly what their officers meant, but instead of doing what was wanted, they did what they were told.
Intentional misunderstandings, or loopholes, are a familiar phenomenon in human society and culture. Loopholes have been exploited throughout history by people loath to comply with a directive and unwilling to risk outright disobedience (Scott, 1985). In law, there is perennial concern with “malicious compliance”, and with distinguishing form from substance, text from purpose, and the letter of the law from the spirit of the law (Fuller, 1957, Isenbergh, 1982, Katz, 2010). In art and fable, there are centuries-old stories of people outwitting malevolent forces through clever misinterpretations, or being tricked in this way by mischievous spirits and gods (Uther, 2004). On the playground, age-old games of guile remain popular to this day (Opie & Opie, 2001). Closer to home, the senior author once told a child, “It’s time to put the tablet down”, only to have the child put the tablet physically down on the table, and keep right on watching their movie (Bridgers, Schulz, & Ullman, 2021).
The processes underlying loopholes are not just of legal, historical, or parental interest: In the field of artificial intelligence and machine learning, machines that ‘do what you say, but not what you want’ (Krakovna, 2020, Lehman et al., 2020) are an increasingly pressing concern among researchers and policy makers (Amodei et al., 2016, Russell, 2019). While current machines do not willfully misunderstand goals any more than a bridge is lazy by virtue of falling down, certain errors give people the impression that machines are ‘cheating’. And regardless of intent, figuring out how to safeguard against such behaviors is a major challenge for AI safety.
The motivation and ability to understand goals and cooperate are fundamental to the success of the human species...
Didn’t expect to see alignment papers to get cited this way in mainstream psychology papers now.
https://www.sciencedirect.com/science/article/abs/pii/S001002772500071X