My key takeaway: “A system is aligned to human values if it tends to generate optimized-looking stuff which is aligned to human values.”
I think this is useful progress. In particular it’s good to try to aim for the AI to produce some particular result in the world, rather than trying to make the AI have some goal—it grounds you in the thing you actually care about in the end.
I’d say the ”… aligned to human values part” is still underspecified (and I think you at least partially agree):
“aligned”: how does the ontology translation between the representation of the “generated optimized-looking stuff” and the representation of human values look like?
“human values”
I think your model of humans is too simplistic. E.g. at the very least it’s lacking a distinction like between “ego-syntonic” and “voluntary” as in this post, though I’d probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
We haven’t described value extrapolation.
(Or from an alternative perspective, our model of humans doesn’t identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)
Positive reinforcement for first trying to better understand the problem before running off and trying to solve it! I think that’s the way to make progress, and I’d encourage others to continue work on more precisely defining the problem, and in particular on getting better models of human cognition to identify how we might be able to rebind the “human values” concept to a better model of what’s happening in human minds.
Btw, I’d have put the corrigibility section into a separate post, it’s not nearly up to the standards of the rest of this post.
To set expectations: this post will not discuss …
Maybe you want to add here that this is not meant to be an overview of alignment difficulties, or an explanation for why alignment is hard.
Meta note: strong upvoted, very good quality comment.
“aligned”: how does the ontology translation between the representation of the “generated optimized-looking stuff” and the representation of human values look like?
Yup, IMO the biggest piece unaddressed in this post is what “aligned” means between two goals, potentially in different ontologies to some extent.
I think your model of humans is too simplistic. E.g. at the very least it’s lacking a distinction like between “ego-syntonic” and “voluntary” as in this post, though I’d probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
I think the model sketched in the post is at roughly the right level of detail to talk about human values specifically, while remaining agnostic to lots of other parts of how human cognition works.
We haven’t described value extrapolation.
(Or from an alternative perspective, our model of humans doesn’t identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)
Yeah, my view on metapreferences is similar to my view on questions of how to combine the values of different humans: metapreferences are important, but their salience is way out of proportion to their importance. (… Though the disproportionality is much less severe for metapreferences than for interpersonal aggregation.)
Like, people notice that humans aren’t always fully consistent, and think about what’s the “right way” to resolve that, and one of the most immediate natural answers is “metapreferences!”. And sometimes that is the right answer, but I view it as more of a last-line fallback for extreme cases. Most of time (I claim) the “right way” to resolve the inconsistency is to notice that people are frequently and egregiously wrong in their estimates of their own values (as evidenced by experiences like “I thought I wanted X, but in hindsight I didn’t”), most of the perceived inconsistency comes from the estimates being wrong, and then the right question to focus on is instead “What does it even mean to be wrong about our own values? What’s the ground truth?”.
metapreferences are important, but their salience is way out of proportion to their importance.
You mean the salience is too high? On the contrary, it’s too low.
one of the most immediate natural answers is “metapreferences!”.
Of course, this is not an answer, but a question-blob.
as evidenced by experiences like “I thought I wanted X, but in hindsight I didn’t”
Yeah I think this is often, maybe almost always, more like “I hadn’t computed / decided to not want [whatever Thing-like thing X gestured at], and then I did compute that”.
a last-line fallback for extreme cases
It’s really not! Our most central values are all of the proleptic (pre-received; foreshadowed) type: friendship, love, experience, relating, becoming. They all can only be expressed in an either vague or incomplete way: “There’s something about this person / myself / this collectivity / this mental activity that draws me in to keep walking that way.”. Part of this is resolvable confusion, but probably not all of it. Part of the fun of relating with other people is that there’s a true open-endedness; you get to cocreate something non-pre-delimited, find out what another [entity that is your size / as complex/surprising/anti-inductive as you] is like, etc. “Metapreferences” isn’t an answer of course, but there’s definitely a question that has to be asked here, and the answer will fall under “metapreferences” broadly construed, in that it will involve stuff that is ongoingly actively determining [all that stuff we would call legible values/preferences].
“What does it even mean to be wrong about our own values? What’s the ground truth?”
Ok we can agree that this should point the way to the right questions and answers, but it’s an extremely broad question-blob.
Nice post!
My key takeaway: “A system is aligned to human values if it tends to generate optimized-looking stuff which is aligned to human values.”
I think this is useful progress. In particular it’s good to try to aim for the AI to produce some particular result in the world, rather than trying to make the AI have some goal—it grounds you in the thing you actually care about in the end.
I’d say the ”… aligned to human values part” is still underspecified (and I think you at least partially agree):
“aligned”: how does the ontology translation between the representation of the “generated optimized-looking stuff” and the representation of human values look like?
“human values”
I think your model of humans is too simplistic. E.g. at the very least it’s lacking a distinction like between “ego-syntonic” and “voluntary” as in this post, though I’d probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
We haven’t described value extrapolation.
(Or from an alternative perspective, our model of humans doesn’t identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)
Positive reinforcement for first trying to better understand the problem before running off and trying to solve it! I think that’s the way to make progress, and I’d encourage others to continue work on more precisely defining the problem, and in particular on getting better models of human cognition to identify how we might be able to rebind the “human values” concept to a better model of what’s happening in human minds.
Btw, I’d have put the corrigibility section into a separate post, it’s not nearly up to the standards of the rest of this post.
Maybe you want to add here that this is not meant to be an overview of alignment difficulties, or an explanation for why alignment is hard.
Meta note: strong upvoted, very good quality comment.
Yup, IMO the biggest piece unaddressed in this post is what “aligned” means between two goals, potentially in different ontologies to some extent.
I think the model sketched in the post is at roughly the right level of detail to talk about human values specifically, while remaining agnostic to lots of other parts of how human cognition works.
Yeah, my view on metapreferences is similar to my view on questions of how to combine the values of different humans: metapreferences are important, but their salience is way out of proportion to their importance. (… Though the disproportionality is much less severe for metapreferences than for interpersonal aggregation.)
Like, people notice that humans aren’t always fully consistent, and think about what’s the “right way” to resolve that, and one of the most immediate natural answers is “metapreferences!”. And sometimes that is the right answer, but I view it as more of a last-line fallback for extreme cases. Most of time (I claim) the “right way” to resolve the inconsistency is to notice that people are frequently and egregiously wrong in their estimates of their own values (as evidenced by experiences like “I thought I wanted X, but in hindsight I didn’t”), most of the perceived inconsistency comes from the estimates being wrong, and then the right question to focus on is instead “What does it even mean to be wrong about our own values? What’s the ground truth?”.
You mean the salience is too high? On the contrary, it’s too low.
Of course, this is not an answer, but a question-blob.
Yeah I think this is often, maybe almost always, more like “I hadn’t computed / decided to not want [whatever Thing-like thing X gestured at], and then I did compute that”.
It’s really not! Our most central values are all of the proleptic (pre-received; foreshadowed) type: friendship, love, experience, relating, becoming. They all can only be expressed in an either vague or incomplete way: “There’s something about this person / myself / this collectivity / this mental activity that draws me in to keep walking that way.”. Part of this is resolvable confusion, but probably not all of it. Part of the fun of relating with other people is that there’s a true open-endedness; you get to cocreate something non-pre-delimited, find out what another [entity that is your size / as complex/surprising/anti-inductive as you] is like, etc. “Metapreferences” isn’t an answer of course, but there’s definitely a question that has to be asked here, and the answer will fall under “metapreferences” broadly construed, in that it will involve stuff that is ongoingly actively determining [all that stuff we would call legible values/preferences].
Ok we can agree that this should point the way to the right questions and answers, but it’s an extremely broad question-blob.