I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:
the “green graph” corresponds to the type of misalignment Ryan describes in this post which is less adversarial and more observable. It is a real problem but also one on which we are making progress. Howeber, as my graph indicates, I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations (including using AIs for capability and alignment research)
the “red graph” corresponds to the more “scheming” or “adversarial“ setting where AIs act covertly to subvert training or monitoring and pursue their own long term goals. like Sam, I don’t see this happening now. I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.
I agree that it’s very unlikely that current AIs are scheming/egregiously misaligned/serious adversaries. (I assume that’s what you meant by “happening now”.)
I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.
This is a long standing disagreement but:
I think it’s totally plausible that AIs (prior to human obsolescence) will be schemers and that we won’t have great/legible evidence of this. Conditional on AIs at the point of full AI R&D automation being schemers, I’d guess a 50% chance of “smoking gun” (to me) evidence and a 75% chance of decently strong model organisms evidence? This will depend a lot on what you count. So I don’t see how to make sure we will find out if this changes, certainly not in a way that’s clearly going to be sufficiently legible.
Even if we have great evidence AIs are schemers and people agree it’s likely that the next/new model is scheming against us competently, that doesn’t solve our problems and it’s pretty likely that AIs would still be deployed internally in this situation.
Relatedly, I think people’s stated views about whether AIs are scheming will be seriously affected by whether there are plausible path’s to mitigation. Idk what this implies about the upsides to mitigation work.
Thus, I think working on mitigations now is important as our level of clarity might not greatly improve (it will be clear that AIs could do way more damaging stuff if they were scheming against us) and in the (IMO totally plausible) situation where AIs are scheming, we’ll be scrambling to mitigate.
I expect another core disagreement is the overall importance of this relative to other AIs where I generally am much more fixated on mitigating AI takeover risk (when thinking about technical AI safety work).
Minor but on:
I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations
I don’t really buy that we should think of the required level of (mundane) alignment as being determined by the stakes of the deployment, rather I think it’s better to think of this as: (1) being required at some high level of capability and (2) that mundane misalignment may differentially slow progress in safety and societal prepardness (relative to capabilities). I do agree with you that as AIs get more capable, the stakes of the deployments with naturally greatly increase due to using AIs to automate safety work and (at a high level of capability) fully hand this off which is extremely high stakes.
I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:
the “green graph” corresponds to the type of misalignment Ryan describes in this post which is less adversarial and more observable. It is a real problem but also one on which we are making progress. Howeber, as my graph indicates, I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations (including using AIs for capability and alignment research)
the “red graph” corresponds to the more “scheming” or “adversarial“ setting where AIs act covertly to subvert training or monitoring and pursue their own long term goals. like Sam, I don’t see this happening now. I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.
I agree that it’s very unlikely that current AIs are scheming/egregiously misaligned/serious adversaries. (I assume that’s what you meant by “happening now”.)
This is a long standing disagreement but:
I think it’s totally plausible that AIs (prior to human obsolescence) will be schemers and that we won’t have great/legible evidence of this. Conditional on AIs at the point of full AI R&D automation being schemers, I’d guess a 50% chance of “smoking gun” (to me) evidence and a 75% chance of decently strong model organisms evidence? This will depend a lot on what you count. So I don’t see how to make sure we will find out if this changes, certainly not in a way that’s clearly going to be sufficiently legible.
Even if we have great evidence AIs are schemers and people agree it’s likely that the next/new model is scheming against us competently, that doesn’t solve our problems and it’s pretty likely that AIs would still be deployed internally in this situation.
Relatedly, I think people’s stated views about whether AIs are scheming will be seriously affected by whether there are plausible path’s to mitigation. Idk what this implies about the upsides to mitigation work.
Thus, I think working on mitigations now is important as our level of clarity might not greatly improve (it will be clear that AIs could do way more damaging stuff if they were scheming against us) and in the (IMO totally plausible) situation where AIs are scheming, we’ll be scrambling to mitigate.
I expect another core disagreement is the overall importance of this relative to other AIs where I generally am much more fixated on mitigating AI takeover risk (when thinking about technical AI safety work).
Minor but on:
I don’t really buy that we should think of the required level of (mundane) alignment as being determined by the stakes of the deployment, rather I think it’s better to think of this as: (1) being required at some high level of capability and (2) that mundane misalignment may differentially slow progress in safety and societal prepardness (relative to capabilities). I do agree with you that as AIs get more capable, the stakes of the deployments with naturally greatly increase due to using AIs to automate safety work and (at a high level of capability) fully hand this off which is extremely high stakes.