The ambiguity of “human values” is a feature, not a bug.
When talking about AI alignment with noobs, there’s this problem where everyone and their grandmother instinctively wants to jump immediately to arguing about which particular values/principles/etc the AI should be aligned with. Arguing about what values/principles/etc other people should follow is one of humanity’s major passtimes, it’s familiar territory, everyone has stupid takes on it and can spew those stupid takes for ages without getting smacked in the face by reality because there’s usually no good feedback mechanism on value claims, so those habits generalize readily to talking about AI values. It’s very much a bike shed[1]. One of the most common first takes people have upon hearing the problem is “Maybe it’s easy, people just haven’t thought about aligning it to X”, where X is love or The Bible or preserving Markov blankets or complexity or niceness or [...]. Or, rather than a positive suggestion, a negative suggestion, like e.g. the classic “But maybe [humans/AI] won’t turn out to have a utility function at all”.
On the other hand, from a technical perspective, the entire “What values tho?” question is mostly just not that central/relevant. Understanding how minds work at all and how to robustly point them at anything at all is basically the whole problem.
So when talking about the topic, I (and presumably others) have often found myself reaching for a way to talk about the alignment target which is intentionally conflationary. Because if I say some concrete specific target, then the idiots will come crawling out of the woodwork saying “what if you align it to <blah> instead?” or “but what if there is no utility function?” or whatever. By using a generic conflationary term, I can just let people fill in whatever thing they think they want there, and focus on the more central parts of the problem.
Historically, when I’ve felt that need, I’ve usually reached for the term “values”. It’s noncommital about what kind of thing we’re even talking about, it’s mildly emphasizing that we’re not yet sure what we’re talking about, and that’s a feature rather than a bug of the term. I’ve historically used “human values” and “AI values” similarly; they’re intentionally noncommittal, and that’s a feature rather than a bug, because it redirects attention to the more central parts of what I’m talking about, rather than triggering peoples’ takes on alignment target.
The old addage is that, when a committee is planning a new nuclear power plant, far more time will be spent arguing about what color to paint the bike shed than on any technically load bearing aspect, because everybody feels qualified to offer a take on the color of the bike shed and to argue about it.
This makes sense as a strategic choice, and thank you for explaining it clearly, but I think it’s bad for discussion norms because readers won’t automatically understand your intent as you’ve explained it here. Would it work to substitute the term “alignment target” or “developer’s goal”?
I agree to a significant extent, and do usually prefer “target” or “goal” in contexts where that makes sense.
It doesn’t always make sense, though. Often I want to talk about how humans’ value-systems work in general, without talking about AI or alignment. Intentional ambiguity still applies there—e.g. I usually don’t want to talk about “preferences” in such situations (because that has specific established mathematical meanings which are not what I mean), or “utility”, or “morality”, or “goodness”, or “goals”, or [...]. The thing I want to talk about is different from any of those, and I don’t necessarily have a good legible explanation of what the thing is I want to talk about, either in my head or in writing. So it’s useful to use an intentionally-underdefined term as a distinct placeholder for this thing, with the expectation that over time I will better understand what it is that I instinctively am gesturing at with the term.
The ambiguity of “human values” is a feature, not a bug.
When talking about AI alignment with noobs, there’s this problem where everyone and their grandmother instinctively wants to jump immediately to arguing about which particular values/principles/etc the AI should be aligned with. Arguing about what values/principles/etc other people should follow is one of humanity’s major passtimes, it’s familiar territory, everyone has stupid takes on it and can spew those stupid takes for ages without getting smacked in the face by reality because there’s usually no good feedback mechanism on value claims, so those habits generalize readily to talking about AI values. It’s very much a bike shed[1]. One of the most common first takes people have upon hearing the problem is “Maybe it’s easy, people just haven’t thought about aligning it to X”, where X is love or The Bible or preserving Markov blankets or complexity or niceness or [...]. Or, rather than a positive suggestion, a negative suggestion, like e.g. the classic “But maybe [humans/AI] won’t turn out to have a utility function at all”.
On the other hand, from a technical perspective, the entire “What values tho?” question is mostly just not that central/relevant. Understanding how minds work at all and how to robustly point them at anything at all is basically the whole problem.
So when talking about the topic, I (and presumably others) have often found myself reaching for a way to talk about the alignment target which is intentionally conflationary. Because if I say some concrete specific target, then the idiots will come crawling out of the woodwork saying “what if you align it to <blah> instead?” or “but what if there is no utility function?” or whatever. By using a generic conflationary term, I can just let people fill in whatever thing they think they want there, and focus on the more central parts of the problem.
Historically, when I’ve felt that need, I’ve usually reached for the term “values”. It’s noncommital about what kind of thing we’re even talking about, it’s mildly emphasizing that we’re not yet sure what we’re talking about, and that’s a feature rather than a bug of the term. I’ve historically used “human values” and “AI values” similarly; they’re intentionally noncommittal, and that’s a feature rather than a bug, because it redirects attention to the more central parts of what I’m talking about, rather than triggering peoples’ takes on alignment target.
The old addage is that, when a committee is planning a new nuclear power plant, far more time will be spent arguing about what color to paint the bike shed than on any technically load bearing aspect, because everybody feels qualified to offer a take on the color of the bike shed and to argue about it.
This makes sense as a strategic choice, and thank you for explaining it clearly, but I think it’s bad for discussion norms because readers won’t automatically understand your intent as you’ve explained it here. Would it work to substitute the term “alignment target” or “developer’s goal”?
I agree to a significant extent, and do usually prefer “target” or “goal” in contexts where that makes sense.
It doesn’t always make sense, though. Often I want to talk about how humans’ value-systems work in general, without talking about AI or alignment. Intentional ambiguity still applies there—e.g. I usually don’t want to talk about “preferences” in such situations (because that has specific established mathematical meanings which are not what I mean), or “utility”, or “morality”, or “goodness”, or “goals”, or [...]. The thing I want to talk about is different from any of those, and I don’t necessarily have a good legible explanation of what the thing is I want to talk about, either in my head or in writing. So it’s useful to use an intentionally-underdefined term as a distinct placeholder for this thing, with the expectation that over time I will better understand what it is that I instinctively am gesturing at with the term.
This comment was really useful. Have you expanded on this in a post at all?