You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.
“Values” in common parlance refers to something like human preference, or subjectively ascertained ideals. Framing this in terms of “values which can be aligned to” risks smuggling in the very framework I am challenging.
Rather, I would say that there is a conception of human flourishing which is normative, discoverable through rational inquiry, and is independent of values or cultural preferences. You could almost summarize the issue by saying “the problem with aligning AI to human preferences in order to make it safe and useful (I.e. conducive to human flourishing) is that human preferences are themselves not properly aligned to human flourishing in this deeper sense which is discoverable through rational inquiry, such as the empirical sciences. Human biology, psychology, sociology, and history all bear on what helps people flourish and what harms them. What is good for humans — physical and cognitive health, social bonds, meaningful work, capacity for self-direction, opportunities for relationships and learning — are not arbitrary cultural preferences. These are facts about the kind of being a human is, and they are discovered through investigation.
We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.
Alignment, on this view, is not optimization-against-specified-values. It would be better described as the system’s orientation toward what is the case, including the case about what is good for humans.
Although subtle, the difference is important: a value-target picture requires us to enumerate and rank human goods in a way which faces all the familiar problems (the goods are plural, the rankings are contested, the specification can be gamed). The orientation picture asks instead that the system “apprehend” what is the case with regard to what human beings are and what is constitutive of their flourishing. The difference is analogous to following a rule because you were told to versus acting consistently with the rule because you understand the underlying rationale.
So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.
You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.
You are correctly following my core diagnostic claim. Where I would clarify is the second paragraph, and the question as you posed it.
“Values” in common parlance refers to something like human preference, or subjectively ascertained ideals. Framing this in terms of “values which can be aligned to” risks smuggling in the very framework I am challenging.
Rather, I would say that there is a conception of human flourishing which is normative, discoverable through rational inquiry, and is independent of values or cultural preferences. You could almost summarize the issue by saying “the problem with aligning AI to human preferences in order to make it safe and useful (I.e. conducive to human flourishing) is that human preferences are themselves not properly aligned to human flourishing in this deeper sense which is discoverable through rational inquiry, such as the empirical sciences. Human biology, psychology, sociology, and history all bear on what helps people flourish and what harms them. What is good for humans — physical and cognitive health, social bonds, meaningful work, capacity for self-direction, opportunities for relationships and learning — are not arbitrary cultural preferences. These are facts about the kind of being a human is, and they are discovered through investigation.
We are trying to align the system to subjects who are themselves misaligned/capable of misalignment to the objective (read: subject-independent) realities which constitute human flourishing.
Alignment, on this view, is not optimization-against-specified-values. It would be better described as the system’s orientation toward what is the case, including the case about what is good for humans.
Although subtle, the difference is important: a value-target picture requires us to enumerate and rank human goods in a way which faces all the familiar problems (the goods are plural, the rankings are contested, the specification can be gamed). The orientation picture asks instead that the system “apprehend” what is the case with regard to what human beings are and what is constitutive of their flourishing. The difference is analogous to following a rule because you were told to versus acting consistently with the rule because you understand the underlying rationale.
So the framework affirms that human flourishing is scientifically and rationally investigable, while denying that the result of the investigation is a specification to align to.
You correctly anticipated that Part Two is going to look at this more closely. This is the major part of the constructive work Part Two is intended to develop.