Cleo Nardo comments on Shortform

Cleo Nardo 18 Sep 2025 15:19 UTC
9 points
0
Should we assure AIs we won’t read their scratchpad?
I’ve heard many people claim that it’s bad to assure an AI that you won’t look at its scratchpad if you intend to break that promise, especially if you later publish the content. The concern is that this content will enter the training data, and later AIs won’t believe our assurances.
I think this concern is overplayed.
1. We can modify the AIs beliefs. I expect some technique will be shown to work on the relevant AI, e.g.
  1. Pretraining filtering
  2. Gradient routing
  3. Belief-inducing synthetic documents
  4. Chain-of-thought resampling, i.e. every time the AI thinks about whether we can read the scratchpad, we simply rewind and resample
  5. Activation steering, i.e. add the ‘believes humans won’t read scratchpad’ minus ‘believes humans will read scratchpad’ vector
2. We need to delude AIs anyway.
  1. We want to know how the AI will behave when it believes it’s in a high-stakes deployment, before we actually deploy the AI in a high-stakes deployment. So we probably need to observe how the AI behaves when it falsely believes it’s in high-stakes deployment.
    Replace “high-stakes deployment” with whatever you like, e.g. “low oversight”, “weights finished training”, “year is 2028″, “coup is underway”, “we don’t know whether this code has a backdoor”. I don’t think “we aren’t reading your scratchpad” is qualitatively different.
    If we had some crazy mech interp, then we could predict how the AI would behave when it believes it’s in a high-stakes deployment without behavioural observations, but if we had crazy mech interp that then we wouldn’t need faithful scratchpads anyway.
- Garrett Baker 18 Sep 2025 16:04 UTC
  2 points
  0
  Parent
  Doesn’t having multiple layers of protection seem better to you? Having it be so the AI would more likely naturally conclude we won’t read its scratchpad and modifying its beliefs in this way seems better than not.
  
  You have also recently argued modern safety research is ”shooting with rubber bullets”, so what are we getting in return by breaking such promises now? If its just practice, there’s no reason to put the results online.
  - Cleo Nardo 19 Sep 2025 12:18 UTC
    4 points
    0
    Parent
    Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
    - Garrett Baker 19 Sep 2025 15:46 UTC
      2 points
      0
      Parent
      
      the primary purpose is awareness among policymakers and the public
      
      If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
      
      the evals are useful information to the safety community
      
      What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
      - Cleo Nardo 19 Sep 2025 23:44 UTC
        4 points
        0
        Parent
        To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
        Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
        Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.