If you want counterarguments, here’s one good place to look: Object-Level AI Risk Skepticism—LessWrong
I expect we might get more today, as it’s the deadline for the Open Philanthropy AI Worldview Contest
If you want counterarguments, here’s one good place to look: Object-Level AI Risk Skepticism—LessWrong
I expect we might get more today, as it’s the deadline for the Open Philanthropy AI Worldview Contest
In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That’s the definition of a differential adversarial example.
If there were an unaligned model with no differential adversarial examples in training, that would be an example of a perfect proxy, not deceptive alignment. That’s outside the scope of this post. But also, if the goal were to follow directions subject to ethical constraints, what would that perfect proxy be? What would result in the same actions across a diverse training set? It seems unlikely that you’d get even a near-perfect proxy here. And even if you did get something fairly close, the model would understand the necessary concepts for the base goal at the beginning of reinforcement learning, so why wouldn’t it just learn to care about that? Setting up a diverse training environment seems likely to be a training strategy by default.
I have a whole section on the key assumptions about the training process and why I expect them to be the default. It’s all in line with what’s already happening, and the labs don’t have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?
Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I’m explicitly not addressing other failure modes in this post.
What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don’t know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?
Which assumptions are wrong? Why?
I don’t think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I’m assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don’t know about a topic to give feedback probably won’t be the strategy that gets us there. Does that answer your question?
From Ajeya Cotra’s post that I linked to:
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
It’s not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.
Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn’t directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn’t (yet) deceptively aligned.
I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won’t be dangerous and should shift when the model switches to reinforcement learning.
This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.
I’d be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results.
Thanks for sharing your perspective! I’ve written up detailed arguments that deceptive alignment is unlikely by default. I’d love to hear what you think of it and how that fits into your view of the alignment landscape.
Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”
Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.
However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.
A model needs situational awareness, a long-term goal, a way to tell if it’s in training, and a way to identify the base goal that isn’t its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely.
Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training.
Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I’d be interested to hear what you think of it!
This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I’d love to hear what you think of it and discuss further!
I recently made an inside view argument that deceptive alignment is unlikely. It doesn’t cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I’d love to hear what you think of it!
This is an interesting point, but it doesn’t undermine the case that deceptive alignment is unlikely. Suppose that a model doesn’t have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn’t understand the correct abstraction, it can’t instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can’t be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model’s goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here.
Nate, please correct me if I’m wrong, but it looks like you:
Skimmed, but did not read, a 3,000-word essay
Posted a 1,200-word response that clearly stated that you hadn’t read it properly
Ignored a comment by one of the post’s authors saying you thoroughly misunderstood their post and a comment by the other author offering to have a conversation with you about it
Found a different person to talk to about their views (Ronny), who also had not read their post
Participated in a 7,500-word dialogue with Ronny in which you speculated about what the core arguments of the original post might be and your disagreements
You’ve clearly put a lot of time into this. If you want to understand the argument, why not just read the original post and talk to the authors directly? It’s very well-written.