The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
I don’t think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D.
Another issue is that AI societies might go off the rails over some longer period in some way which doesn’t eliminate AI R&D productivity, but would be catastrophic from an alignment perspective.
This isn’t to say there isn’t anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.
Do you agree the feedback loops for capabilities are better right now?
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
For this argument it’s not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)
Do you agree the feedback loops for capabilities are better right now?
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
I don’t think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D.
Another issue is that AI societies might go off the rails over some longer period in some way which doesn’t eliminate AI R&D productivity, but would be catastrophic from an alignment perspective.
This isn’t to say there isn’t anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.
Do you agree the feedback loops for capabilities are better right now?
For this argument it’s not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
(Though note that it’s unclear whether this progress will mitigate scheming risk.)