Apparently the headcount for US corporations follows a power-law distribution, apart from mid-sized corporations, which fit a lognormal distribution better. I fit a power law distribution to the data (after truncating all datapoints with over 40 employees, which created a worse fit), which gave p(x)∼399x−1.29. This seems to imply that there are ~400 independent AI safety researchers (though note that p(x) is probability density function and this estimate might be way off); Claude estimates 400-600 for comparison. Integrating this distribution over x∈[1,∞) gives ~1400 (2 s.f.) total employees working on AI safety or safety-adjacent work (∞ might be a bad upper bound, as the largest orgs have <100 employees).
this might be a bad upper bound, as the largest orgs have <100 employees
I’ve seen you use “upper bound” and “underestimate” twice in this thread, and I want to lightly push back against this phrasing. I think the likely level of variance in these estimates is much too high for them to be any sort of bounds or to informatively be described as “under” any estimate.
That is, although your analysis of the data you have may imply, given perfect data, an underestiamte or upper bound, I think your data is too imperfect for those to be useful descriptions with respect to reality. Garbage in, garbage out, as they say.
The numbers are still useful, and if you have a model of the orgs that you missed, the magnitude of your over-counting, or a sensitivity analysis of your process, these would be interesting to hear about. But these will probably take the form more as approximate probability distributions more than a binary bigger-or-smaller-than-truth shots from the hip.
By “upper bound”, I meant “upper bound b on the definite integral ∫bap(x)dx”. I.e., for the kind of hacky thing I’m doing here, the integral is very sensitive to the choice of bounds a,b. For example, the integral does not converge for a=0. I think all my data here should be treated as incomplete and all my calculations crude estimates at best.
I edited the original comment to say ”∞ might be a bad upper bound” for clarity.
Yeah, I think we’re in agreement, I’m just saying the phrase “upper bound” is not useful compared to eg providing various estimates for various bounds & making a table or graph, and a derivative of the results wrt the parameter estimate you inferred.
So, conduct a sensitivity analysis on the definite integral with respect to choices of integration bounds? I’m not sure this level of analysis is merited given the incomplete data and unreliable estimation methodology for the number of independent researchers. Like, I’m not even confident that the underlying distribution is a power law (instead of, say, a composite of power law and lognormal distributions, or a truncated power law), and the value of p(1) seems very sensitive to data in the vicinity, so I wouldn’t want to rely on this estimate except as a very crude first pass. I would support an investigation into the number of independent researchers in the ecosystem, which I would find useful.
I think again we are on the same page & this sounds reasonable, I just want to argue that “lower bound” and “upper bound” are less-than-informative descriptions of the uncertainty in the estimates.
I definitely think that people should not look at my estimates and say “here is a good 95% confidence interval upper bound of the number of employees in the AI safety ecosystem.” I think people should look at my estimates and say “here is a good 95% confidence interval lower bound of the number of employees in the AI safety ecosystem,” because you can just add up the names. I.e., even if there might be 10x the number of employees as I estimated, I’m at least 95% confident that there are more than my estimate obtained by just counting names (obviously excluding the 10% fudge factor).
Apparently the headcount for US corporations follows a power-law distribution, apart from mid-sized corporations, which fit a lognormal distribution better. I fit a power law distribution to the data (after truncating all datapoints with over 40 employees, which created a worse fit), which gave p(x)∼399x−1.29. This seems to imply that there are ~400 independent AI safety researchers (though note that p(x) is probability density function and this estimate might be way off); Claude estimates 400-600 for comparison. Integrating this distribution over x∈[1,∞) gives ~1400 (2 s.f.) total employees working on AI safety or safety-adjacent work (∞ might be a bad upper bound, as the largest orgs have <100 employees).
I’ve seen you use “upper bound” and “underestimate” twice in this thread, and I want to lightly push back against this phrasing. I think the likely level of variance in these estimates is much too high for them to be any sort of bounds or to informatively be described as “under” any estimate.
That is, although your analysis of the data you have may imply, given perfect data, an underestiamte or upper bound, I think your data is too imperfect for those to be useful descriptions with respect to reality. Garbage in, garbage out, as they say.
The numbers are still useful, and if you have a model of the orgs that you missed, the magnitude of your over-counting, or a sensitivity analysis of your process, these would be interesting to hear about. But these will probably take the form more as approximate probability distributions more than a binary bigger-or-smaller-than-truth shots from the hip.
By “upper bound”, I meant “upper bound b on the definite integral ∫bap(x)dx”. I.e., for the kind of hacky thing I’m doing here, the integral is very sensitive to the choice of bounds a,b. For example, the integral does not converge for a=0. I think all my data here should be treated as incomplete and all my calculations crude estimates at best.
I edited the original comment to say ”∞ might be a bad upper bound” for clarity.
Yeah, I think we’re in agreement, I’m just saying the phrase “upper bound” is not useful compared to eg providing various estimates for various bounds & making a table or graph, and a derivative of the results wrt the parameter estimate you inferred.
So, conduct a sensitivity analysis on the definite integral with respect to choices of integration bounds? I’m not sure this level of analysis is merited given the incomplete data and unreliable estimation methodology for the number of independent researchers. Like, I’m not even confident that the underlying distribution is a power law (instead of, say, a composite of power law and lognormal distributions, or a truncated power law), and the value of p(1) seems very sensitive to data in the vicinity, so I wouldn’t want to rely on this estimate except as a very crude first pass. I would support an investigation into the number of independent researchers in the ecosystem, which I would find useful.
I think again we are on the same page & this sounds reasonable, I just want to argue that “lower bound” and “upper bound” are less-than-informative descriptions of the uncertainty in the estimates.
I definitely think that people should not look at my estimates and say “here is a good 95% confidence interval upper bound of the number of employees in the AI safety ecosystem.” I think people should look at my estimates and say “here is a good 95% confidence interval lower bound of the number of employees in the AI safety ecosystem,” because you can just add up the names. I.e., even if there might be 10x the number of employees as I estimated, I’m at least 95% confident that there are more than my estimate obtained by just counting names (obviously excluding the 10% fudge factor).