I agree that it’s fraught to try to answer the question of whose values to align to. In fact I think most serious alignment researchers agree on this point (alas there’s a bunch of less serious folks who’ve gotten caught up in trying to RL their way to alignment, which faces this problem of whose values to align to). And I agree there’s something to figure out about human psychology about what we’d consider flourishing and this is likely relatively fixed and not cultural specific, in that people from different cultures could learn to adapt to some kind of world that met certain features that induced whatever we think flourishing is.
But I get the impression you think some specific values are determinable in a scientific way, and that those can be aligned to? Maybe I’m misreading you and I need to wait for the next post?
You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.
“Values” in common parlance refers to something like human preference, or subjectively ascertained ideals. Framing this in terms of “values which can be aligned to” risks smuggling in the very framework I am challenging.
Rather, I would say that there is a conception of human flourishing which is normative, discoverable through rational inquiry, and is independent of values or cultural preferences. You could almost summarize the issue by saying “the problem with aligning AI to human preferences in order to make it safe and useful (I.e. conducive to human flourishing) is that human preferences are themselves not properly aligned to human flourishing in this deeper sense which is discoverable through rational inquiry, such as the empirical sciences. Human biology, psychology, sociology, and history all bear on what helps people flourish and what harms them. What is good for humans — physical and cognitive health, social bonds, meaningful work, capacity for self-direction, opportunities for relationships and learning — are not arbitrary cultural preferences. These are facts about the kind of being a human is, and they are discovered through investigation.
We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.
Alignment, on this view, is not optimization-against-specified-values. It would be better described as the system’s orientation toward what is the case, including the case about what is good for humans.
Although subtle, the difference is important: a value-target picture requires us to enumerate and rank human goods in a way which faces all the familiar problems (the goods are plural, the rankings are contested, the specification can be gamed). The orientation picture asks instead that the system “apprehend” what is the case with regard to what human beings are and what is constitutive of their flourishing. The difference is analogous to following a rule because you were told to versus acting consistently with the rule because you understand the underlying rationale.
So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.
You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.
I agree that it’s fraught to try to answer the question of whose values to align to. In fact I think most serious alignment researchers agree on this point (alas there’s a bunch of less serious folks who’ve gotten caught up in trying to RL their way to alignment, which faces this problem of whose values to align to). And I agree there’s something to figure out about human psychology about what we’d consider flourishing and this is likely relatively fixed and not cultural specific, in that people from different cultures could learn to adapt to some kind of world that met certain features that induced whatever we think flourishing is.
But I get the impression you think some specific values are determinable in a scientific way, and that those can be aligned to? Maybe I’m misreading you and I need to wait for the next post?
You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.
“Values” in common parlance refers to something like human preference, or subjectively ascertained ideals. Framing this in terms of “values which can be aligned to” risks smuggling in the very framework I am challenging.
Rather, I would say that there is a conception of human flourishing which is normative, discoverable through rational inquiry, and is independent of values or cultural preferences. You could almost summarize the issue by saying “the problem with aligning AI to human preferences in order to make it safe and useful (I.e. conducive to human flourishing) is that human preferences are themselves not properly aligned to human flourishing in this deeper sense which is discoverable through rational inquiry, such as the empirical sciences. Human biology, psychology, sociology, and history all bear on what helps people flourish and what harms them. What is good for humans — physical and cognitive health, social bonds, meaningful work, capacity for self-direction, opportunities for relationships and learning — are not arbitrary cultural preferences. These are facts about the kind of being a human is, and they are discovered through investigation.
We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.
Alignment, on this view, is not optimization-against-specified-values. It would be better described as the system’s orientation toward what is the case, including the case about what is good for humans.
Although subtle, the difference is important: a value-target picture requires us to enumerate and rank human goods in a way which faces all the familiar problems (the goods are plural, the rankings are contested, the specification can be gamed). The orientation picture asks instead that the system “apprehend” what is the case with regard to what human beings are and what is constitutive of their flourishing. The difference is analogous to following a rule because you were told to versus acting consistently with the rule because you understand the underlying rationale.
So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.
You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.