One approach which I didn’t see obviously listed here, though is related to e.g. “The structure of the planning algorithm”, is to firstconstruct a psychological and philosophical model of what exactly human values are and how they are represented in the brain, before trying to translate them into a utility function.
One (but not the only possible) premise for this approach is that the utility function formalism is not particularly suited for things like changing values or dealing with ontology shifts; while a utility function may be a reasonable formalism for describing the choices that an agent would make at any given time, the underlying mechanism that generates those choices is not particularly well-characterized by a utility function. A toy problem that I have used before is the question of how to update your utility function if it was previously based on an ontology defined in N dimensions, but suddenly the ontology gets updated to include N+1 dimensions:
… we can now consider what problems would follow if we started off with a very human-like AI that had the same concepts as we did, but then expanded its conceptual space to allow for entirely new kinds of concepts. This could happen if it self-modified to have new kinds of sensory or thought modalities that it could associate its existing concepts with, thus developing new kinds of quality dimensions.
An analogy helps demonstrate this problem: suppose that you’re operating in a two-dimensional space, where a rectangle has been drawn to mark a certain area as “forbidden” or “allowed”. Say that you’re an inhabitant of Flatland. But then you suddenly become aware that actually, the world is three-dimensional, and has a height dimension as well! That raises the question of, how should the “forbidden” or “allowed” area be understood in this new three-dimensional world? Do the walls of the rectangle extend infinitely in the height dimension, or perhaps just some certain distance in it? If just a certain distance, does the rectangle have a “roof” or “floor”, or can you just enter (or leave) the rectangle from the top or the bottom? There doesn’t seem to be any clear way to tell.
As a historical curiosity, this dilemma actually kind of really happened when airplanes were invented: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? Courts and legislation eventually settled on the latter answer.
In a sense, we can say that law is a kind of a utility function representing a subset of human values at some given time; when the ontology that those values are based on shifts, the laws get updated as well. A question to ask is: what is the reasoning process by which humans update their values in such a situation? And given that a mature AI’s ontology is bound to be different than ours, how do we want the AI to update its values / utility function in an analogous situation?
Framing the question this way suggests that constructing a utility function is the wrong place to start; rather we want to start with understanding the psychological foundation of human values first, and then figure out how we should derive utility functions from those. That way we can also know how to update the utility function when necessary.
Furthermore, as this post notes, humans routinely make various assumptions about the relation of behavior and preferences, and a proper understanding of the psychology and neuroscience of decision-making seems necessary for evaluating those assumptions.
Thanks for the detailed comment! I definitely intended to include all of this within “The structure of the planning algorithm”, but I wasn’t aware of the papers you cited. I’ll add a pointer to this comment to the post.
One approach which I didn’t see obviously listed here, though is related to e.g. “The structure of the planning algorithm”, is to first construct a psychological and philosophical model of what exactly human values are and how they are represented in the brain, before trying to translate them into a utility function.
One (but not the only possible) premise for this approach is that the utility function formalism is not particularly suited for things like changing values or dealing with ontology shifts; while a utility function may be a reasonable formalism for describing the choices that an agent would make at any given time, the underlying mechanism that generates those choices is not particularly well-characterized by a utility function. A toy problem that I have used before is the question of how to update your utility function if it was previously based on an ontology defined in N dimensions, but suddenly the ontology gets updated to include N+1 dimensions:
In a sense, we can say that law is a kind of a utility function representing a subset of human values at some given time; when the ontology that those values are based on shifts, the laws get updated as well. A question to ask is: what is the reasoning process by which humans update their values in such a situation? And given that a mature AI’s ontology is bound to be different than ours, how do we want the AI to update its values / utility function in an analogous situation?
Framing the question this way suggests that constructing a utility function is the wrong place to start; rather we want to start with understanding the psychological foundation of human values first, and then figure out how we should derive utility functions from those. That way we can also know how to update the utility function when necessary.
Furthermore, as this post notes, humans routinely make various assumptions about the relation of behavior and preferences, and a proper understanding of the psychology and neuroscience of decision-making seems necessary for evaluating those assumptions.
Some papers that take this kind of an approach are Sotala 2016, Sarma & Hay 2017, Sarma, Safron & Hay 2018.
Thanks for the detailed comment! I definitely intended to include all of this within “The structure of the planning algorithm”, but I wasn’t aware of the papers you cited. I’ll add a pointer to this comment to the post.