Thanks for sharing your concerns and helping us be more calibrated on the value of this study.
I agree that a control group is vital for good science. Nonetheless, I think that such an experiment is valuable and informative, even if it doesn’t meet the high standards required by many professional science disciplines.
I believe in the necessity of acting under uncertainty. Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
The biggest critique for me isn’t that there isn’t a control group, but that they don’t have a limitations section that suggests follow-up experiments with a control group. A lot can be forgiven if you’re open and transparent about it.
I’ve only skimmed this post, but I suspect your principle of substitution has a wrong framing. LLM’s can make the situation worse even if a human could easily access the information through other means, see beware trivial inconveniences. I suspect that neglecting this factor causes you to significantly underrate the risks here.
Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
Note—that if I thought regulations would be temporary, or had a chance of loosening over time after evals found that the risks from models at compute size X would not be catastrophic, I would be much less worried about all the things I’m worried about re. open source and power and and banning open source
But I just don’t think that most regulations will be temporary. A large number of people want to move compute limits down over time. Some orgs (like PauseAI or anything Leahy touches) want much lower limits than are implied by the EO. And of course regulators in general are extremely risk averse, and the trend is almost always for regulations to increase.
If the AI safety movement could creditably promise in some way that they would actively push for laws whose limits raised over time, in the default case, I’d be less worried. But given (1) the conflict on this issue within AI safety itself (2) the default way that regulations work, I cannot make myself believe that “temporary regulation” is ever going to happen.
Oh, I don’t think it actually would end up being temporary, because I expect with high probability that the empirical results of more robust evaluations would confirm that open-source AI is indeed dangerous. I meant temporary in the sense that the initial restrictions might either a) have a time-limit b) be subjective to re-evaluation at a specified point.
I agree that a control group is vital for good science. Nonetheless, I think that such an experiment is valuable and informative, even if it doesn’t meet the high standards required by many professional science disciplines.
I believe in the necessity of acting under uncertainty. Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
But… this study doesn’t provide evidence that LLMs increase bioweapon risk.
Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.
I’m using your broader definition, where I imagine that Stella is excluding things that don’t meet a more stringent standard.
And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we’d like.
Thanks for sharing your concerns and helping us be more calibrated on the value of this study.
I agree that a control group is vital for good science. Nonetheless, I think that such an experiment is valuable and informative, even if it doesn’t meet the high standards required by many professional science disciplines.
I believe in the necessity of acting under uncertainty. Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
The biggest critique for me isn’t that there isn’t a control group, but that they don’t have a limitations section that suggests follow-up experiments with a control group. A lot can be forgiven if you’re open and transparent about it.
I’ve only skimmed this post, but I suspect your principle of substitution has a wrong framing. LLM’s can make the situation worse even if a human could easily access the information through other means, see beware trivial inconveniences. I suspect that neglecting this factor causes you to significantly underrate the risks here.
Note—that if I thought regulations would be temporary, or had a chance of loosening over time after evals found that the risks from models at compute size X would not be catastrophic, I would be much less worried about all the things I’m worried about re. open source and power and and banning open source
But I just don’t think that most regulations will be temporary. A large number of people want to move compute limits down over time. Some orgs (like PauseAI or anything Leahy touches) want much lower limits than are implied by the EO. And of course regulators in general are extremely risk averse, and the trend is almost always for regulations to increase.
If the AI safety movement could creditably promise in some way that they would actively push for laws whose limits raised over time, in the default case, I’d be less worried. But given (1) the conflict on this issue within AI safety itself (2) the default way that regulations work, I cannot make myself believe that “temporary regulation” is ever going to happen.
Oh, I don’t think it actually would end up being temporary, because I expect with high probability that the empirical results of more robust evaluations would confirm that open-source AI is indeed dangerous. I meant temporary in the sense that the initial restrictions might either a) have a time-limit b) be subjective to re-evaluation at a specified point.
But… this study doesn’t provide evidence that LLMs increase bioweapon risk.
Define evidence.
I’m not asking this just to be pedantic, but because I think it’ll make the answer to your objection clearer.
Evidence for X is when you see something that’s more likely in a world with X than in a world with some other condition not X.
Generally substantially more likely; for good reason many people only use “evidence” to mean “reasonably strong evidence.”
Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.
I’m using your broader definition, where I imagine that Stella is excluding things that don’t meet a more stringent standard.
And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we’d like.
Broad definition