Do you agree the feedback loops for capabilities are better right now?
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
(Though note that it’s unclear whether this progress will mitigate scheming risk.)