I sometimes hear people say things like, “While we have a bunch of uncertainty over what powerful AIs’ motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable.” I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.
First, in humans. We see a pretty broad range of human motivations:
I would be happy to give huge amounts of power to some humans but not others. And for those others, there’s a wide variety of ways they might be misaligned. Many people are too selfish to themselves and/or their families; many people are ideological about a cause or belief; the most notable worry with some people is that they are sadistic or vengeful; etc.
This variation is somehow explained primarily by something like ~~1kB of genetic information and the set of experiences people had. This is a pretty small amount of information.
Second, in current LLMs. We can get LLMs to behave roughly according to a wide variety of motivations, including intended motivations, scheming motivations and reward-seeking motivations. This is largely a function of how the training data maps onto pretraining priors (so this evidence is therefore not statistically independent of the human evidence). If we observe that RLing models on reward-hackable objectives causes them to be broadly misaligned, then we can tell the model that reward-hacking during training is ok, and the model doesn’t end up broadly misaligned.
I’m pointing at evidence that the motivations of agents aren’t overdetermined, which is in turn some evidence that developers can influence AI motivations if they can correctly identify the levers (which may be hard with status-quo behavioral oversight!). I’m definitely not claiming that alignment of sovereign superintelligence is easy. I think that alignment sufficiently robust to withstand sovereign superintelligent optimization is a narrow target (if people try to make sovereign superintelligence). But this issome reason why I think attaining trustworthy corrigible assistants of intermediate-but-transformative capability levels may be tractable.
I sometimes hear people say things like, “While we have a bunch of uncertainty over what powerful AIs’ motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable.” I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.
First, in humans. We see a pretty broad range of human motivations:
I would be happy to give huge amounts of power to some humans but not others. And for those others, there’s a wide variety of ways they might be misaligned. Many people are too selfish to themselves and/or their families; many people are ideological about a cause or belief; the most notable worry with some people is that they are sadistic or vengeful; etc.
This variation is somehow explained primarily by something like ~~1kB of genetic information and the set of experiences people had. This is a pretty small amount of information.
Second, in current LLMs. We can get LLMs to behave roughly according to a wide variety of motivations, including intended motivations, scheming motivations and reward-seeking motivations. This is largely a function of how the training data maps onto pretraining priors (so this evidence is therefore not statistically independent of the human evidence). If we observe that RLing models on reward-hackable objectives causes them to be broadly misaligned, then we can tell the model that reward-hacking during training is ok, and the model doesn’t end up broadly misaligned.
I’m pointing at evidence that the motivations of agents aren’t overdetermined, which is in turn some evidence that developers can influence AI motivations if they can correctly identify the levers (which may be hard with status-quo behavioral oversight!). I’m definitely not claiming that alignment of sovereign superintelligence is easy. I think that alignment sufficiently robust to withstand sovereign superintelligent optimization is a narrow target (if people try to make sovereign superintelligence). But this is some reason why I think attaining trustworthy corrigible assistants of intermediate-but-transformative capability levels may be tractable.