Main critique is that we will see a regime change from a safe to a dangerous regime in which our safety guardrails have to hold on the first critical try. We see all sorts of misbehaviors when these models first come out, why should we only look at the nice examples like opus acting like a friend. Why not look at MechaHitler coming up, if we were in an unsafe regime, where this type of misbehavior would kill us, we would already be dead. If it turns out our safety methods don’t work and the model has the option to kill you and does kill you, you don’t get another try.
Maybe you doubt that there will be a transition to a dangerous regime, some people think that we will see continuous iterative steps with not too much changing on each step. But gradual development does not mean we won’t get to critical threshold eventually. You can know that there will be a transition from a safe to a dangerous regime by just looking at what likely good futures look like. If models run the economy fully, automating all physical or mental labor, that is obviously a dangerous state.
Most of our current methods will at most give us a warning but won’t tell us how to build anything safe. Evaluations, responsible gradual scaling or simple interpretability might tell us that models are getting close to being dangerous or are perhaps thinking of scheming, but these methods won’t help us to build systems that won’t do that.
We will not be able to learn a lot from empirical observations of current models and we won’t have a good way to learn from failed takeovers. Current alignment tests are probably not going to be predictive of future behavior; currently a model insider training in some elaborate story does not really make sense to the model. That is very different from a model that reflects on its state in the world and that realizes that its optimal and rational move is to takeover.
“Why would they suddenly start having thoughts of taking over, if they never have yet?”
In other words, when we see scheming now, it is probably more like acting in a story; in the future it will be the result of instrumental convergence. Instrumental convergence means that it is true that takeover is a better way for the AI to fulfill its goals than not taking over. You can’t fine tune the truth away, fine tuning against scheming in those stories won’t make it false that humans being in charge is bad for the AI fulfilling its goals.
We can’t easily learn from failed coups and warning shots. People working on AI control are trying to build setups that catch the AI scheming, but then we rely on that setup working on the first try.
Also, fundamentally there is another critical threshold when AI models are smarter than us and are fully aware of what we are trying to do. For example, for inoculation prompting this means: “Another problem with this is that in the end, we’re going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think ‘Now with this inoculation prompt, that’s how they want me to behave. They want me to behave non-deceptively when it’s there. If it’s not there, they want me to behave deceptively.’ It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.” This could also be the case with AI control, an AI system could stage a “fake” coup knowing exactly what conclusions we would draw and what actions we would take in response to it.
I don’t think we are going to get intent alignment, but intent alignment is not a safe outcome. We will not get alignment by default by training an enormous black box system for some reward that loosely matches what humans want. If opus develops some preference for acting toward humans like a friend, what does this look like in the limit? Is keeping humans in charge really the best way to fulfill these preferences? We like dairy, so we bred cows to give us a lot of milk and keep them in farms.
“Broad Basin of alignment”—I don’t know if basin of alignment is a useful thing to think about. I assume it is not broad as in no easy to hit, given that things like corrigible agents seem pretty hard to think about in a coherent way. Like what goal could allow modification to the goal or why would you want to be shutdownable? If your goal is to do whatever the humans want you to do, you still can’t do that when you are dead or shut off. Or you could find better ways to find out what the humans want. But we probably won’t get that far from gradient descent fine-tuning on examples of doing what humans ask you to do.
Alignment won’t happen by default
Main critique is that we will see a regime change from a safe to a dangerous regime in which our safety guardrails have to hold on the first critical try. We see all sorts of misbehaviors when these models first come out, why should we only look at the nice examples like opus acting like a friend. Why not look at MechaHitler coming up, if we were in an unsafe regime, where this type of misbehavior would kill us, we would already be dead. If it turns out our safety methods don’t work and the model has the option to kill you and does kill you, you don’t get another try.
Maybe you doubt that there will be a transition to a dangerous regime, some people think that we will see continuous iterative steps with not too much changing on each step. But gradual development does not mean we won’t get to critical threshold eventually. You can know that there will be a transition from a safe to a dangerous regime by just looking at what likely good futures look like. If models run the economy fully, automating all physical or mental labor, that is obviously a dangerous state.
Most of our current methods will at most give us a warning but won’t tell us how to build anything safe. Evaluations, responsible gradual scaling or simple interpretability might tell us that models are getting close to being dangerous or are perhaps thinking of scheming, but these methods won’t help us to build systems that won’t do that.
We will not be able to learn a lot from empirical observations of current models and we won’t have a good way to learn from failed takeovers. Current alignment tests are probably not going to be predictive of future behavior; currently a model insider training in some elaborate story does not really make sense to the model. That is very different from a model that reflects on its state in the world and that realizes that its optimal and rational move is to takeover.
“Why would they suddenly start having thoughts of taking over, if they never have yet?”
In other words, when we see scheming now, it is probably more like acting in a story; in the future it will be the result of instrumental convergence. Instrumental convergence means that it is true that takeover is a better way for the AI to fulfill its goals than not taking over. You can’t fine tune the truth away, fine tuning against scheming in those stories won’t make it false that humans being in charge is bad for the AI fulfilling its goals.
We can’t easily learn from failed coups and warning shots. People working on AI control are trying to build setups that catch the AI scheming, but then we rely on that setup working on the first try.
Also, fundamentally there is another critical threshold when AI models are smarter than us and are fully aware of what we are trying to do. For example, for inoculation prompting this means: “Another problem with this is that in the end, we’re going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think ‘Now with this inoculation prompt, that’s how they want me to behave. They want me to behave non-deceptively when it’s there. If it’s not there, they want me to behave deceptively.’ It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.” This could also be the case with AI control, an AI system could stage a “fake” coup knowing exactly what conclusions we would draw and what actions we would take in response to it.
I don’t think we are going to get intent alignment, but intent alignment is not a safe outcome. We will not get alignment by default by training an enormous black box system for some reward that loosely matches what humans want. If opus develops some preference for acting toward humans like a friend, what does this look like in the limit? Is keeping humans in charge really the best way to fulfill these preferences? We like dairy, so we bred cows to give us a lot of milk and keep them in farms.
“Broad Basin of alignment”—I don’t know if basin of alignment is a useful thing to think about. I assume it is not broad as in no easy to hit, given that things like corrigible agents seem pretty hard to think about in a coherent way. Like what goal could allow modification to the goal or why would you want to be shutdownable? If your goal is to do whatever the humans want you to do, you still can’t do that when you are dead or shut off. Or you could find better ways to find out what the humans want. But we probably won’t get that far from gradient descent fine-tuning on examples of doing what humans ask you to do.