See Ryan Kidd on How MATS addresses “mass movement building” concerns. One useful-ish framing from that post is is that the safety is more neglected than capabilities, so you should expect the marginal FTE on safety to add more than the marginal FTE on capabilities.
I’ll make up numbers to illustrate the point: let’s say there are 10,000 capabilities researchers and 1000 AI safety researchers — that’s a rough 10:1 ratio. Now say MATS funnels 500 people into capabilities and 500 into AI safety. Now it’s 10,500:1500 ratio, or 7:1, which is much better. I think 50% of the MATS mentees going into capabilities is very pessimistic.
49% are “Working/interning on AI alignment/control.”
29% are “Conducting alignment research independently.”
1.4% are “Working/interning on AI capabilities.”
If we (perhaps unfairly) ignore the independent researcher, then that’s 3 to capabilities for every 100 to safety!
This also ignores the effects of having capabilities researchers more clued up / familiar with AI safety — you might think this is net-positive (bc they are marginally more likely to update on new evidence and pivot to safety) or net-positive (bc they misuse our technical insights). I think this probably shakes out to be net-positive.
But if locally robust alignment makes faster progress on capabilities without being a version of locally robust alignment that can become asymptotically robust alignment and give us a win node, then it’s not only not as obvious, but might well be primarily made of backfire. I don’t believe the most extreme version of this claim, but I used to, I know people who do. The “people going into capabilities” is also including “local alignment that doesn’t generalize to asymptotic alignment”.
The naive way to get asymptotic, and possibly the only way, is to have locally robust alignment that is reliably always ahead of impactfulness of deployment. But if impactfulness of deployment is always operating at 110% of local robustness, that seems like a recipe for disaster. So the question is ultimately one about how to stay in the local validity window indefinitely, even as that local window gets harder to be sure we’ve maintained.
Also there’s the whole thing about how alignment is fundamentally about figuring out what to ask for, and being robust about that is a confusing question at best. Most of MATS seems too focused on the (very hard!) problems of achieving local-alignment-good-enough-to-keep-going to spend the necessary effort to succeed at the really hard parts later (which is not obviously a mistake on MATS’s part, but might well be).
It’s much easier to get reliable robustness for local alignment when you don’t have to solve “what should a thing overwhelmingly smarter than us do?” yet, but if there’s a hole coming where a thing moderately smarter than us has the ability to confuse us into choosing a plan that seems good to it but which is not what we want, we may fall off the manifold we would have wanted to be on if we had been able to see things coming without getting confused.
I now think the net-neutral ratio of new median safety to capabilities researchers is probably closer to 1:100 (though I think 1:10 isn’t crazy). However, the longitudinal MATS career data indicates our alumni are employed in safety to capabilities roles 8:1.
See Ryan Kidd on How MATS addresses “mass movement building” concerns. One useful-ish framing from that post is is that the safety is more neglected than capabilities, so you should expect the marginal FTE on safety to add more than the marginal FTE on capabilities.
I’ll make up numbers to illustrate the point: let’s say there are 10,000 capabilities researchers and 1000 AI safety researchers — that’s a rough 10:1 ratio. Now say MATS funnels 500 people into capabilities and 500 into AI safety. Now it’s 10,500:1500 ratio, or 7:1, which is much better. I think 50% of the MATS mentees going into capabilities is very pessimistic.
According to the MATS Alumni Impact Analysis, the ratios ares 78%:
49% are “Working/interning on AI alignment/control.”
29% are “Conducting alignment research independently.”
1.4% are “Working/interning on AI capabilities.”
If we (perhaps unfairly) ignore the independent researcher, then that’s 3 to capabilities for every 100 to safety!
This also ignores the effects of having capabilities researchers more clued up / familiar with AI safety — you might think this is net-positive (bc they are marginally more likely to update on new evidence and pivot to safety) or net-positive (bc they misuse our technical insights). I think this probably shakes out to be net-positive.
But if locally robust alignment makes faster progress on capabilities without being a version of locally robust alignment that can become asymptotically robust alignment and give us a win node, then it’s not only not as obvious, but might well be primarily made of backfire. I don’t believe the most extreme version of this claim, but I used to, I know people who do. The “people going into capabilities” is also including “local alignment that doesn’t generalize to asymptotic alignment”.
The naive way to get asymptotic, and possibly the only way, is to have locally robust alignment that is reliably always ahead of impactfulness of deployment. But if impactfulness of deployment is always operating at 110% of local robustness, that seems like a recipe for disaster. So the question is ultimately one about how to stay in the local validity window indefinitely, even as that local window gets harder to be sure we’ve maintained.
Also there’s the whole thing about how alignment is fundamentally about figuring out what to ask for, and being robust about that is a confusing question at best. Most of MATS seems too focused on the (very hard!) problems of achieving local-alignment-good-enough-to-keep-going to spend the necessary effort to succeed at the really hard parts later (which is not obviously a mistake on MATS’s part, but might well be).
It’s much easier to get reliable robustness for local alignment when you don’t have to solve “what should a thing overwhelmingly smarter than us do?” yet, but if there’s a hole coming where a thing moderately smarter than us has the ability to confuse us into choosing a plan that seems good to it but which is not what we want, we may fall off the manifold we would have wanted to be on if we had been able to see things coming without getting confused.
I now think the net-neutral ratio of new median safety to capabilities researchers is probably closer to 1:100 (though I think 1:10 isn’t crazy). However, the longitudinal MATS career data indicates our alumni are employed in safety to capabilities roles 8:1.