One theme I notice throughout the “evidence” section is that it’s mostly starting from arguments that the NAH might not be true, then counterarguments, and sometimes counter-counterarguments. I didn’t see as much in the way of positive reasons we would expect the NAH to be true, as opposed to negative reasons (i.e. counterarguments to arguments against NAH). Obviously I have some thoughts on that topic, but I’d be curious to hear yours.
Particulars
Wentworth thinks it is quite likely (~70%) that a broad class of systems (including neural networks) trained for predictive power will end up with a simple embedding of human values.
Subtle point: I believe the claim you’re drawing from was that it’s highly likely that the inputs to human values (i.e. the “things humans care about”) are natural abstractions. (~70% was for that plus NAH; today I’d assign at least 85%.) Whether “human values” are a natural abstraction in their own right is, under my current understanding, more uncertain.
The NAH only says that AIs will develop abstractions similar to humans when they have similar priors, which may not always be the case.
There’s a technical sense in which this is true, but it’s one of those things where the data should completely swamp the effect of the prior for an extremely wide range of priors.
There are still some problems here—we might be able to infer each others’ values across small inferential distances where everybody shares cultural similarities, but values can differ widely across larger cultural gaps.
This is the main kind of argument which makes me think human values are not a natural abstraction.
We could argue that navigators throughout history were choosing from a discrete set of abstractions, with their choices determined by factors like available tools, objectives, knowledge, or cultural beliefs, but the set of abstractions itself being a function of the environment, not the navigators.
The dependence of abstractions on data makes it clear that something like this is necessary. For instance, a culture which has never encountered snow will probably not have a concept of it; snow is not a natural abstraction of their data. On the other hand, if you take such people and put them somewhere snowy, they will immediately recognize snow as “a thing”; snow is still the kind-of-thing which humans recognize as an abstraction.
I expect this to carry over to AI to a large extent: even when AIs are using concepts not currently familiar to humans, they’ll still be the kinds-of-concepts which a human is capable of using. (At least until you get to really huge hardware, where the AI can use enough hardware brute-force to handle abstractions which literally can’t fit in a human brain.)
In the paper ImageNet-trained CNNs are biased towards texture, the authors observe that the features CNNs use when classifying images lean more towards texture, and away from shape (which seems much more natural and intuitive to humans).
However, this also feels like a feature of the “valley of confused abstractions”. Humans didn’t evolve based on individual snapshots of reality, we evolved with moving pictures as input data.
Also the “C” part of “CNN” is especially relevant here; we’d expect convolutional techniques to bias toward repeating patterns (like texture) in general.
Subtle point: I believe the claim you’re drawing from was that it’s highly likely that the inputs to human values (i.e. the “things humans care about”) are natural abstractions.
To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can differ greatly in things like (1) metaethics about why liberty matters, (2) the extent to which liberty should be traded off with other values, if indeed it can be traded off at all. People might disagree about interpretations of these concepts (especially different cultures), but in a world where these weren’t natural abstractions, we might expect disagreement in the first place to be extremely hard because the discussants aren’t even operating on the same wavelength, i.e. they don’t really have a set of shared concepts to structure their disagreements around.
One theme I notice throughout the “evidence” section is that it’s mostly starting from arguments that the NAH might not be true, then counterarguments, and sometimes counter-counterarguments.
Yeah, that’s a good point. I think partly that’s because my thinking about the NAH basically starts with “the inside view seems to support it, in the sense that the abstractions that I use seem natural to me”, and so from there I start thinking about whether this is a situation in which the inside view should be trusted, which leads to considering the validity of arguments against it (i.e. “am I just anthropomorphising?”).
However, to give a few specific reasons I think it seems plausible that don’t just rely on the inside view:
Humans were partly selected for their ability to act in the world to improve their situations. Since abstractions are all about finding good high-level models that describe things you might care about and how they interact with the rest of the world, it seems like there should have been a competitive pressure for humans to find good abstractions. This argument doesn’t feel very contingent on the specifics human cognition or what our simplicity priors are; rather the abstractions should be a function of the environment (hence convergence to the same abstractions by other cognitive systems which are also under competition, e.g. in the form of computational efficiency requirements, seems intuitive)
There’s lots of empirical evidence that seems to support it, at least at a weak level (e.g. CLIP as discussed in my post, or GPT-3 as mentioned by Rohin in his summary for the newsletter)
Returning to the clarification you made about inputs to human values being the natural abstraction rather than the actual values, it seems like the fact that different cultures can have a shared basis for disagreement might support some form of the NAH rather than arguing against it? I guess that point has a few caveats though, e.g. (1) all cultures have been shaped significantly by global factors like European imperialism, and (2) humans are all very close together in mind design space so we’d expect something like this anyway, natural abstraction or not
General Thoughts
Solid piece!
One theme I notice throughout the “evidence” section is that it’s mostly starting from arguments that the NAH might not be true, then counterarguments, and sometimes counter-counterarguments. I didn’t see as much in the way of positive reasons we would expect the NAH to be true, as opposed to negative reasons (i.e. counterarguments to arguments against NAH). Obviously I have some thoughts on that topic, but I’d be curious to hear yours.
Particulars
Subtle point: I believe the claim you’re drawing from was that it’s highly likely that the inputs to human values (i.e. the “things humans care about”) are natural abstractions. (~70% was for that plus NAH; today I’d assign at least 85%.) Whether “human values” are a natural abstraction in their own right is, under my current understanding, more uncertain.
There’s a technical sense in which this is true, but it’s one of those things where the data should completely swamp the effect of the prior for an extremely wide range of priors.
This is the main kind of argument which makes me think human values are not a natural abstraction.
The dependence of abstractions on data makes it clear that something like this is necessary. For instance, a culture which has never encountered snow will probably not have a concept of it; snow is not a natural abstraction of their data. On the other hand, if you take such people and put them somewhere snowy, they will immediately recognize snow as “a thing”; snow is still the kind-of-thing which humans recognize as an abstraction.
I expect this to carry over to AI to a large extent: even when AIs are using concepts not currently familiar to humans, they’ll still be the kinds-of-concepts which a human is capable of using. (At least until you get to really huge hardware, where the AI can use enough hardware brute-force to handle abstractions which literally can’t fit in a human brain.)
Also the “C” part of “CNN” is especially relevant here; we’d expect convolutional techniques to bias toward repeating patterns (like texture) in general.
Thanks for the comment!
To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can differ greatly in things like (1) metaethics about why liberty matters, (2) the extent to which liberty should be traded off with other values, if indeed it can be traded off at all. People might disagree about interpretations of these concepts (especially different cultures), but in a world where these weren’t natural abstractions, we might expect disagreement in the first place to be extremely hard because the discussants aren’t even operating on the same wavelength, i.e. they don’t really have a set of shared concepts to structure their disagreements around.
Yeah, that’s a good point. I think partly that’s because my thinking about the NAH basically starts with “the inside view seems to support it, in the sense that the abstractions that I use seem natural to me”, and so from there I start thinking about whether this is a situation in which the inside view should be trusted, which leads to considering the validity of arguments against it (i.e. “am I just anthropomorphising?”).
However, to give a few specific reasons I think it seems plausible that don’t just rely on the inside view:
Humans were partly selected for their ability to act in the world to improve their situations. Since abstractions are all about finding good high-level models that describe things you might care about and how they interact with the rest of the world, it seems like there should have been a competitive pressure for humans to find good abstractions. This argument doesn’t feel very contingent on the specifics human cognition or what our simplicity priors are; rather the abstractions should be a function of the environment (hence convergence to the same abstractions by other cognitive systems which are also under competition, e.g. in the form of computational efficiency requirements, seems intuitive)
There’s lots of empirical evidence that seems to support it, at least at a weak level (e.g. CLIP as discussed in my post, or GPT-3 as mentioned by Rohin in his summary for the newsletter)
Returning to the clarification you made about inputs to human values being the natural abstraction rather than the actual values, it seems like the fact that different cultures can have a shared basis for disagreement might support some form of the NAH rather than arguing against it? I guess that point has a few caveats though, e.g. (1) all cultures have been shaped significantly by global factors like European imperialism, and (2) humans are all very close together in mind design space so we’d expect something like this anyway, natural abstraction or not