To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.
What would it look like for a human (/coherently acting human collective) to (“natively”?) generalize their values out of distribution?
To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.