I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
My point then is that human behaviour isn’t surprisingly good. It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
Some things we could reasonably observe occurring differently are e.g.:
More or less variation in behaviour among humans.
More or less variation in behaviour in atypical situations.
More or less external requirements to keep behaviour generally ‘good’.
More or less deviation between stated preferences and revealed preferences.
However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
As a simple model, consider four possibilities for traits:
x is common and good.
y is uncommon and bad.
z is uncommon and good.
w is common and bad.
x is common and good (e.g. empathy): evidence for correct generalisation!
y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
[w isn’t actually bad, all things considered… correct generalization!]
[w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
It seems to me that the only evidence you’d accept of misgeneralization would be [terribleand common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x]. We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment): If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit. If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid. f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
My point then is that human behaviour isn’t surprisingly good.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
Some things we could reasonably observe occurring differently are e.g.:
More or less variation in behaviour among humans.
More or less variation in behaviour in atypical situations.
More or less external requirements to keep behaviour generally ‘good’.
More or less deviation between stated preferences and revealed preferences.
However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
As a simple model, consider four possibilities for traits:
x is common and good.
y is uncommon and bad.
z is uncommon and good.
w is common and bad.
x is common and good (e.g. empathy): evidence for correct generalisation!
y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
[w isn’t actually bad, all things considered… correct generalization!]
[w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
It seems to me that the only evidence you’d accept of misgeneralization would be [terrible and common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x].
We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit.
If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
(Basically argues that the critic in the brain generates the values)
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
(The genomic prior can’t be strong, because it has massive limitations in what it can encode).
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid.
f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].