1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.
1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.