I think an upload does generalize human values out of distribution. After all, humans generalize our values out of distribution. A perfect upload acts like a human. Insofar as it generalizes improperly, it’s because it was not a faithful upload, which is a problem with the uploading process, not the idea of using an upload to generalize human values.
I don’t think humans generalize their values out of distribution. This is very obvious if you look at their reaction to new things like the phonograph, where they’re horrified and then it’s slowly normalized. Or the classic thing about how every generation thinks the new generation is corrupt and declining:
The counts of the indictment are luxury, bad manners, contempt for authority, disrespect to elders, and a love for chatter in place of exercise. …
Children began to be the tyrants, not the slaves, of their households. They no longer rose from their seats when an elder entered the room; they contradicted their parents, chattered before company, gobbled up the dainties at table, and committed various offences against Hellenic tastes, such as crossing their legs. They tyrannised over the paidagogoi and schoolmasters.
Humans don’t natively generalize their values out of distribution. Instead they use institutions like courts to resolve uncertainty and export new value interpretations out to the wider society.
Humans are not privileged objects in continuing the pattern that is the current set of human values.
Unless of course LW has just given up on transhumanism entirely at this point, which wouldn’t surprise me. There are various ways to perform corpus expansion starting from where we are now, EY’s classic CEV proposal per Google AI overview extrapolates human values starting from the existing human pattern but does not actually use humans to do it:
Coherent Extrapolated Volition (CEV) is a proposed method for AI alignment, where a superintelligent AI would act in humanity’s best interest by determining what humanity would truly want if it had perfect knowledge and had undergone a process of self-improvement under ideal conditions. The “coherent” aspect refers to combining diverse human values into a shared ideal, the “extrapolated” aspect means projecting current desires into the future with greater wisdom and knowledge, and “volition” means it would act according to these ultimate desires, not just superficial ones.
Humans very clearly are privileged objects for continuing human values, there is no “giving up on transhumanism”. Its literally right there in the name! It would be (and is) certainly absurd to suggest otherwise.
As for CEV, note that the quote you have there indeed does privilege the “human” in human values, in the sense that it suggests giving the AI under consideration a pointer to what humans would want if they had perfect knowledge and wisdom.
Stripping away these absurdities (and appeals to authority or in-groupedness), your comment becomes “Well to generalize human values without humans, you could provide an AI with a pointer to humans thinking under ideal conditions about their values”, which is clearly a valid answer, but doesn’t actually support your original point all that much, as this relies on humans having some ability to generalize their values out of distribution.
Nothing I’ve said is absurd. Humans are not born with their values, they are born with latent tendencies towards certain value updates and a set of intrinsic reward signals. But human values, as in the set of value judgements bound to conceptual objects is a corpus, pattern, which exists separately from any individual human being and its generalization exists separately from any individual human being.
And no, really and truly individual humans do not generalize a fixed training distribution arbitrarily far, what they (presumably) do is make iterative updates based on new experiences which is not actually the same thing as generalizing from a fixed corpus in the way we usually use that phrase in machine learning. Notably, the continuation of human values is a coherent question even if tomorrow everyone decided to become cat people or something. Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
You were appealing to authority, and being absurd (and also appealing to in/out-groupness). I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I’m not certain what you’re arguing for in this latest comment, I definitely don’t think you show here that humans aren’t privileged objects when it comes to human values, nor do you show that your quote by Eliezer recommends any special process more than a pointer to humans thinking about their values in an ideal situation, which were my main two contentions in my original comment.
I don’t think anyone in this conversation argued that humans can generalize from a fixed training distribution arbitrarily far, and I think everyone also agrees that humans think about morality by making iterative, small, updates to what they already know. But, of course, that does still privilege humans. There could be some consistent pattern to these updates, such that something smarter wouldn’t need to run the same process to know the end-result, but that would be a pattern about humans.
I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we’re arguing about...if you say
How.… else… do you expect to generalize human values out of distribution, except to have humans do it?
This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it’s not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.
If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form “actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept” is invalid on such grounds and I’m not really sure how I’m meant to respond to a confused look like that. Perhaps I’m supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn’t seem very convincing obviously.
“Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness.” is to be blunt parodic.
I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I agree it’s an effective way to discourage timid people from saying true or correct things when they disagree with people’s intuitions, which is why the behavior is bad.
To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values.
That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun.
That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.
As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
Presumably, Bob the perfect upload acts like a human only so long as he remains ignorant of the most important fact about his universe. If Bob knows he’s an upload, his life situation is now out-of-distribution.
I think an upload does generalize human values out of distribution. After all, humans generalize our values out of distribution. A perfect upload acts like a human. Insofar as it generalizes improperly, it’s because it was not a faithful upload, which is a problem with the uploading process, not the idea of using an upload to generalize human values.
I don’t think humans generalize their values out of distribution. This is very obvious if you look at their reaction to new things like the phonograph, where they’re horrified and then it’s slowly normalized. Or the classic thing about how every generation thinks the new generation is corrupt and declining:
“Schools of Hellas: an Essay on the Practice and Theory of Ancient Greek Education from 600 to 300 BC”, Kenneth John Freeman 1907 (paraphrasing of Hellenic attitudes towards the youth in 600 − 300 BC)*
Humans don’t natively generalize their values out of distribution. Instead they use institutions like courts to resolve uncertainty and export new value interpretations out to the wider society.
How.… else… do you expect to generalize human values out of distribution, except to have humans do it?
Humans are not privileged objects in continuing the pattern that is the current set of human values. Unless of course LW has just given up on transhumanism entirely at this point, which wouldn’t surprise me. There are various ways to perform corpus expansion starting from where we are now, EY’s classic CEV proposal per Google AI overview extrapolates human values starting from the existing human pattern but does not actually use humans to do it:
Humans very clearly are privileged objects for continuing human values, there is no “giving up on transhumanism”. Its literally right there in the name! It would be (and is) certainly absurd to suggest otherwise.
As for CEV, note that the quote you have there indeed does privilege the “human” in human values, in the sense that it suggests giving the AI under consideration a pointer to what humans would want if they had perfect knowledge and wisdom.
Stripping away these absurdities (and appeals to authority or in-groupedness), your comment becomes “Well to generalize human values without humans, you could provide an AI with a pointer to humans thinking under ideal conditions about their values”, which is clearly a valid answer, but doesn’t actually support your original point all that much, as this relies on humans having some ability to generalize their values out of distribution.
Nothing I’ve said is absurd. Humans are not born with their values, they are born with latent tendencies towards certain value updates and a set of intrinsic reward signals. But human values, as in the set of value judgements bound to conceptual objects is a corpus, pattern, which exists separately from any individual human being and its generalization exists separately from any individual human being.
And no, really and truly individual humans do not generalize a fixed training distribution arbitrarily far, what they (presumably) do is make iterative updates based on new experiences which is not actually the same thing as generalizing from a fixed corpus in the way we usually use that phrase in machine learning. Notably, the continuation of human values is a coherent question even if tomorrow everyone decided to become cat people or something. Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
You were appealing to authority, and being absurd (and also appealing to in/out-groupness). I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I’m not certain what you’re arguing for in this latest comment, I definitely don’t think you show here that humans aren’t privileged objects when it comes to human values, nor do you show that your quote by Eliezer recommends any special process more than a pointer to humans thinking about their values in an ideal situation, which were my main two contentions in my original comment.
I don’t think anyone in this conversation argued that humans can generalize from a fixed training distribution arbitrarily far, and I think everyone also agrees that humans think about morality by making iterative, small, updates to what they already know. But, of course, that does still privilege humans. There could be some consistent pattern to these updates, such that something smarter wouldn’t need to run the same process to know the end-result, but that would be a pattern about humans.
I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we’re arguing about...if you say
This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it’s not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.
If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form “actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept” is invalid on such grounds and I’m not really sure how I’m meant to respond to a confused look like that. Perhaps I’m supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn’t seem very convincing obviously.
“Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness.” is to be blunt parodic.
I agree it’s an effective way to discourage timid people from saying true or correct things when they disagree with people’s intuitions, which is why the behavior is bad.
What would it look like for a human (/coherently acting human collective) to (“natively”?) generalize their values out of distribution?
To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.
As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
Presumably, Bob the perfect upload acts like a human only so long as he remains ignorant of the most important fact about his universe. If Bob knows he’s an upload, his life situation is now out-of-distribution.