Cleo Nardo comments on Shortform

Cleo Nardo 19 Sep 2025 12:18 UTC
4 points
0
Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
- Garrett Baker 19 Sep 2025 15:46 UTC
  2 points
  0
  Parent
  
  the primary purpose is awareness among policymakers and the public
  
  If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
  
  the evals are useful information to the safety community
  
  What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
  - Cleo Nardo 19 Sep 2025 23:44 UTC
    4 points
    0
    Parent
    To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
    Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
    Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.