Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
the primary purpose is awareness among policymakers and the public
If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
the evals are useful information to the safety community
What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.
Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.