“there still is semi-agreement (not just within MIRI) that the alignment problem is due to difficulty aligning the AI’s goals with human goals, rather than difficulty finding the universe’s objective morality to program the AI to follow.”
Right, that’s part of the consensus we wrote the post in order to dispute. Even though people say they recognize a difference between “align the AI’s goals with human goals” and “find the universe’s objective morality”, they might have implicitly assumed the two are identical as a consequence of other parts of the paradigm’s logic (like CEV, and the mistake theory framing, and various other equivalent foundational ideas).
Of course individuals have complex views and many who have thought about this have resolved it for themselves in one way or another. But we still think it’s an overall blindspot for the field, especially since there isn’t consensus that it’s even an issue.
We are saying in the post that we find it more helpful to start from a different set of assumptions.
Yes, I think the most likely story of survival, has the ASI deciding not to destroy the world, because we somehow succeeded (using who knows what method) to make the ASI think similarly to humans and non-consequentially follow human norms/morality which makes it listen to humans.
That feels slightly more plausible than the story where we formally define human values, create a robust reward function for it (solve outer alignment), and then solve inner alignment.
However, I think getting the ASI to follow human norms (even if it becomes so powerful no one can sanction it), isn’t necessarily that different from getting the ASI to follow a the bare minimum of human values, e.g. don’t kill people.
-2. When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’. Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.
My guess is that most people actually follow Conflict Theory rather than Mistake Theory, since most of them believe in the orthogonality hypothesis (that you can have any goal with any level of intelligence). They also wish to resolve this conflict by getting the AI to follow the bare minimum of human values (or norms like not killing people), rather than brute force (i.e. AI control, which is seen as a temporary solution).
I’m glad you mentioned the orthogonality hypothesis (that goals are orthogonal to intelligence). Part of our argument can be seen as reconceptualizing orthogonality, and rejecting a strong version of it.
So it’s interesting to me that you see orthogonality as leading folks to conflict theory over mistake theory. I am not sure it has to lead either way.
In our language, we say that it’s better to have an “endogenous preferences” theory than an “exogenous preferences” theory, meaning that it’s better to be able to construct models where preferences can change as a function of the mechanisms and processes being modeled. This is not true for most rational actor and RL models; they typically assume preferences at the start (i.e. “exogenously”).
Here’s a paragraph making this more concrete, which I pulled from a shorter summary of https://arxiv.org/abs/2412.19010 which I’ve been working on.
”Consider an individual’s preference for certain types of music, say, rock versus classical. We need not assume they begin with a scalar utility or ``taste″ for one over the other. Instead, in our theory, their preferences are shaped by their experiences: perhaps they grow up hearing rock music frequently at home, attend rock concerts with friends, and rarely encounter classical music. These experiences are encoded as memories. When prompted by the question “What kind of person am I?” and summarizing these memories, the individual’s pattern completion network p might generate the assembly “I am the kind of person who likes rock music”. This self-attribution, influenced by past behavior and exposure, becomes part of the global workspace context. Subsequently, when faced with a choice like “What music should I listen to?”, p, conditioned on the self-description “I like rock music”, is more likely to complete the pattern with “Listen to rock music”. Through this process, the individual’s behavior (listening to rock) is influenced by their inferred identity, which itself arose from past behavior and exposure. This example illustrates how preferences are constructed endogenously through the aggregation and consolidation of experience, rather than being static, external inputs to the model.”
Yes, I think when it comes to endogenous vs exogenous preferences, you are probably accurate about most of the field believing in exogenous preferences (although I’m not sure since I’m also unfamiliar). Rohin Shah once talked about ambiguous value learning, but my guess is that most people aren’t focusing on that direction.
It’s possible that people who believe in exogenous preferences will feel confused if they are described as following Mistake Theory, since exogenous preferences sounds like Conflict Theory, and doesn’t sound like “pursuing an objective morality.”
My personal opinion (and my opinion isn’t that important here), is that humans ourselves follow a combination of endogenous and exogenous preferences. Our morals are very strongly shaped by what everyone around us believe, often more than we realize.
But at the same time, our hardcoded biology determines how our morals get shaped. If we hunt and kill animals for food or sport, but observe the animals we kill following various norms in their animal society, we will not adopt their norms by mere exposure, and be indifferent to their norms. This is because our hardcoded biology did not encode any tendency to care about those animals or to respect their norms.
I agree that studying endogenous preferences, and how to make them go right, is valuable!
Right, that’s part of the consensus we wrote the post in order to dispute. Even though people say they recognize a difference between “align the AI’s goals with human goals” and “find the universe’s objective morality”, they might have implicitly assumed the two are identical as a consequence of other parts of the paradigm’s logic (like CEV, and the mistake theory framing, and various other equivalent foundational ideas).
Of course individuals have complex views and many who have thought about this have resolved it for themselves in one way or another. But we still think it’s an overall blindspot for the field, especially since there isn’t consensus that it’s even an issue.
We are saying in the post that we find it more helpful to start from a different set of assumptions.
Yes, I think the most likely story of survival, has the ASI deciding not to destroy the world, because we somehow succeeded (using who knows what method) to make the ASI think similarly to humans and non-consequentially follow human norms/morality which makes it listen to humans.
That feels slightly more plausible than the story where we formally define human values, create a robust reward function for it (solve outer alignment), and then solve inner alignment.
However, I think getting the ASI to follow human norms (even if it becomes so powerful no one can sanction it), isn’t necessarily that different from getting the ASI to follow a the bare minimum of human values, e.g. don’t kill people.
The List of Lethalities says,
My guess is that most people actually follow Conflict Theory rather than Mistake Theory, since most of them believe in the orthogonality hypothesis (that you can have any goal with any level of intelligence). They also wish to resolve this conflict by getting the AI to follow the bare minimum of human values (or norms like not killing people), rather than brute force (i.e. AI control, which is seen as a temporary solution).
I’m glad you mentioned the orthogonality hypothesis (that goals are orthogonal to intelligence). Part of our argument can be seen as reconceptualizing orthogonality, and rejecting a strong version of it.
So it’s interesting to me that you see orthogonality as leading folks to conflict theory over mistake theory. I am not sure it has to lead either way.
In our language, we say that it’s better to have an “endogenous preferences” theory than an “exogenous preferences” theory, meaning that it’s better to be able to construct models where preferences can change as a function of the mechanisms and processes being modeled. This is not true for most rational actor and RL models; they typically assume preferences at the start (i.e. “exogenously”).
Here’s a paragraph making this more concrete, which I pulled from a shorter summary of https://arxiv.org/abs/2412.19010 which I’ve been working on.
”Consider an individual’s preference for certain types of music, say, rock versus classical. We need not assume they begin with a scalar utility or ``taste″ for one over the other. Instead, in our theory, their preferences are shaped by their experiences: perhaps they grow up hearing rock music frequently at home, attend rock concerts with friends, and rarely encounter classical music. These experiences are encoded as memories. When prompted by the question “What kind of person am I?” and summarizing these memories, the individual’s pattern completion network p might generate the assembly “I am the kind of person who likes rock music”. This self-attribution, influenced by past behavior and exposure, becomes part of the global workspace context. Subsequently, when faced with a choice like “What music should I listen to?”, p, conditioned on the self-description “I like rock music”, is more likely to complete the pattern with “Listen to rock music”. Through this process, the individual’s behavior (listening to rock) is influenced by their inferred identity, which itself arose from past behavior and exposure. This example illustrates how preferences are constructed endogenously through the aggregation and consolidation of experience, rather than being static, external inputs to the model.”
Yes, I think when it comes to endogenous vs exogenous preferences, you are probably accurate about most of the field believing in exogenous preferences (although I’m not sure since I’m also unfamiliar). Rohin Shah once talked about ambiguous value learning, but my guess is that most people aren’t focusing on that direction.
It’s possible that people who believe in exogenous preferences will feel confused if they are described as following Mistake Theory, since exogenous preferences sounds like Conflict Theory, and doesn’t sound like “pursuing an objective morality.”
My personal opinion (and my opinion isn’t that important here), is that humans ourselves follow a combination of endogenous and exogenous preferences. Our morals are very strongly shaped by what everyone around us believe, often more than we realize.
But at the same time, our hardcoded biology determines how our morals get shaped. If we hunt and kill animals for food or sport, but observe the animals we kill following various norms in their animal society, we will not adopt their norms by mere exposure, and be indifferent to their norms. This is because our hardcoded biology did not encode any tendency to care about those animals or to respect their norms.
I agree that studying endogenous preferences, and how to make them go right, is valuable!