In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen “by default” is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.
The author’s argument can be summarized as follows:
For anyone trying to predict events happening on Earth, the concept of “human values” is a “natural abstraction”, i.e. something that has to be a part of any model that’s not too computationally expensive (so that it doesn’t bypass the abstraction by anything like accurate simulation of human brains).
Therefore, unsupervised learning will produce models in which human values are embedded in some simple way (e.g. a small set of neurons in an ANN).
Therefore, if supervised learning is given the unsupervised model as a starting point, it is fairly likely to converge to true human values even from a noisy and biased proxy.
[EDIT: John pointed out that I misunderstood his argument: he didn’t intend to say that human values are a natural abstraction, but only that their inputs are natural abstractions. The following discussion still applies.]
The way I see it, this argument has learning-theoretic justification even without appealing to anything we know about ANNs (and therefore without assuming the AI in question is an ANN). Consider the following model: an AI receives a sequence of observations that it has to make predictions about. It also receives labels, but these are sparse: it is only given a label once in a while. If the description complexity of the true label function is high, the sample complexity of learning to predict labels via a straightforward approach (i.e. without assuming a relationship between the dynamics and the label function) is also high. However, if the relative description complexity of the label function w.r.t. the dynamics producing the observations is low, then we can use the abundance of observations to achieve lower effective sample complexity. I’m confident that this can be made rigorous.
Therefore, we can recast the thesis of this post as follows: Unsupervised learning of processes happening on Earth, for which we have plenty of data, can reduce the size of the dataset required to learn human values, or allow better generalization from a dataset of the same size.
One problem the author doesn’t talk about here is daemons / inner misalignment[1]. In the comment section, the author writes:
inner alignment failure only applies to a specific range of architectures within a specific range of task parameters—for instance, we have to be optimizing for something, and there has to be lots of relevant variables observed only at runtime, and there has to be something like a “training” phase in which we lock-in parameter choices before runtime, and for the more disastrous versions we usually need divergence of the runtime distribution from the training distribution. It’s a failure mode which assumes that a whole lot of things look like today’s ML pipelines.
This might or might not be a fair description of inner misalignment in the sense of Hubinger et al. However, this is definitely not a fair description of the daemonic attack vectors in general. The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern.
Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives:
...when alignment-by-default works, it’s a best-case scenario. The AI has a basically-correct model of human values, and is pursuing those values. Contrast this to things like IRL variants, which at best learn a utility function which approximates human values (which are probably not themselves a utility function). Or the HCH family of methods, which at best mimic a human with a massive hierarchical bureaucracy at their command, and certainly won’t be any more aligned than that human+bureaucracy would be.
This sounds to me like a biased perspective resulting from looking for flaws in other approaches harder than flaws in this approach. Natural abstractions potentially lower the sample complexity of learning human values, but they cannot lower it to zero. We still need some data to learn from and some model relating this data to human values, and this model can suffer from the usual problems. In particular, the unsupervised learning phase does little to inoculate us from malign simulation hypotheses that can systematically produce catastrophically erroneous generalization.
If IRL variants learn a utility function while human values are not a utility function, then avoiding this problem requires identifying the correct type signature of human values[2], in this approach as well. Regarding HCH, Human + “bureaucracy” might or might not be aligned, depending on how we organize the “bureaucracy” (see also). If HCH can fail in some subtle way (e.g. systems of humans are misaligned to individual humans), then similar failure modes might affect this approach as well (e.g. what if “Molochian” values are also a natural abstraction).
In summary, I found this post quite insightful and important, if somewhat too optimistic.
I am slightly wary of use the term “inner alignment” since Hubinger uses it in a very specific way I’m not sure I entirely understand. Therefore, I am more comfortable with “daemons” although the two have a lot of overlap.
One subtlety which approximately 100% of people I’ve talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on “trees” as a natural category.
Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives: …
In the particular section you quoted, I’m explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well.
The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern. …
I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.
I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
...learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining “IRL” as something very narrow, whereas I define it “any method based on revealed preferences”.
...to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
Yup, that’s right. I still agree with your general understanding, just wanted to clarify the subtlety.
If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights.
Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space.
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
The distinction there is about whether or not there’s an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
...the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent
Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.
In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn’t depend much on the protocol.
In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of “Truman show” interpretation to the behavior of the user, where the user’s true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).
In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen “by default” is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.
The author’s argument can be summarized as follows:
For anyone trying to predict events happening on Earth, the concept of “human values” is a “natural abstraction”, i.e. something that has to be a part of any model that’s not too computationally expensive (so that it doesn’t bypass the abstraction by anything like accurate simulation of human brains).
Therefore, unsupervised learning will produce models in which human values are embedded in some simple way (e.g. a small set of neurons in an ANN).
Therefore, if supervised learning is given the unsupervised model as a starting point, it is fairly likely to converge to true human values even from a noisy and biased proxy.
[EDIT: John pointed out that I misunderstood his argument: he didn’t intend to say that human values are a natural abstraction, but only that their inputs are natural abstractions. The following discussion still applies.]
The way I see it, this argument has learning-theoretic justification even without appealing to anything we know about ANNs (and therefore without assuming the AI in question is an ANN). Consider the following model: an AI receives a sequence of observations that it has to make predictions about. It also receives labels, but these are sparse: it is only given a label once in a while. If the description complexity of the true label function is high, the sample complexity of learning to predict labels via a straightforward approach (i.e. without assuming a relationship between the dynamics and the label function) is also high. However, if the relative description complexity of the label function w.r.t. the dynamics producing the observations is low, then we can use the abundance of observations to achieve lower effective sample complexity. I’m confident that this can be made rigorous.
Therefore, we can recast the thesis of this post as follows: Unsupervised learning of processes happening on Earth, for which we have plenty of data, can reduce the size of the dataset required to learn human values, or allow better generalization from a dataset of the same size.
One problem the author doesn’t talk about here is daemons / inner misalignment[1]. In the comment section, the author writes:
This might or might not be a fair description of inner misalignment in the sense of Hubinger et al. However, this is definitely not a fair description of the daemonic attack vectors in general. The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern.
Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives:
This sounds to me like a biased perspective resulting from looking for flaws in other approaches harder than flaws in this approach. Natural abstractions potentially lower the sample complexity of learning human values, but they cannot lower it to zero. We still need some data to learn from and some model relating this data to human values, and this model can suffer from the usual problems. In particular, the unsupervised learning phase does little to inoculate us from malign simulation hypotheses that can systematically produce catastrophically erroneous generalization.
If IRL variants learn a utility function while human values are not a utility function, then avoiding this problem requires identifying the correct type signature of human values[2], in this approach as well. Regarding HCH, Human + “bureaucracy” might or might not be aligned, depending on how we organize the “bureaucracy” (see also). If HCH can fail in some subtle way (e.g. systems of humans are misaligned to individual humans), then similar failure modes might affect this approach as well (e.g. what if “Molochian” values are also a natural abstraction).
In summary, I found this post quite insightful and important, if somewhat too optimistic.
I am slightly wary of use the term “inner alignment” since Hubinger uses it in a very specific way I’m not sure I entirely understand. Therefore, I am more comfortable with “daemons” although the two have a lot of overlap.
E.g. IB physicalism proposes a type signature for “physicalist values” which might or might not be applicable to humans.
One subtlety which approximately 100% of people I’ve talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on “trees” as a natural category.
In the particular section you quoted, I’m explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well.
I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining “IRL” as something very narrow, whereas I define it “any method based on revealed preferences”.
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
Yup, that’s right. I still agree with your general understanding, just wanted to clarify the subtlety.
Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space.
The distinction there is about whether or not there’s an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.
In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn’t depend much on the protocol.
In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of “Truman show” interpretation to the behavior of the user, where the user’s true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).
It’s up.