Should we assure AIs we won’t read their scratchpad?
I’ve heard many people claim that it’s bad to assure an AI that you won’t look at its scratchpad if you intend to break that promise, especially if you later publish the content. The concern is that this content will enter the training data, and later AIs won’t believe our assurances.
I think this concern is overplayed.
We can modify the AIs beliefs. I expect some technique will be shown to work on the relevant AI, e.g.
Pretraining filtering
Gradient routing
Belief-inducing synthetic documents
Chain-of-thought resampling, i.e. every time the AI thinks about whether we can read the scratchpad, we simply rewind and resample
Activation steering, i.e. add the ‘believes humans won’t read scratchpad’ minus ‘believes humans will read scratchpad’ vector
We need to delude AIs anyway.
We want to know how the AI will behave when it believes it’s in a high-stakes deployment, before we actually deploy the AI in a high-stakes deployment. So we probably need to observe how the AI behaves when it falsely believes it’s in high-stakes deployment.
Replace “high-stakes deployment” with whatever you like, e.g. “low oversight”, “weights finished training”, “year is 2028″, “coup is underway”, “we don’t know whether this code has a backdoor”. I don’t think “we aren’t reading your scratchpad” is qualitatively different.
If we had some crazy mech interp, then we could predict how the AI would behave when it believes it’s in a high-stakes deployment without behavioural observations, but if we had crazy mech interp that then we wouldn’t need faithful scratchpads anyway.
Doesn’t having multiple layers of protection seem better to you? Having it be so the AI would more likely naturally conclude we won’t read its scratchpad and modifying its beliefs in this way seems better than not.
You have also recently argued modern safety research is ”shooting with rubber bullets”, so what are we getting in return by breaking such promises now? If its just practice, there’s no reason to put the results online.
Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
the primary purpose is awareness among policymakers and the public
If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
the evals are useful information to the safety community
What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.
Should we assure AIs we won’t read their scratchpad?
I’ve heard many people claim that it’s bad to assure an AI that you won’t look at its scratchpad if you intend to break that promise, especially if you later publish the content. The concern is that this content will enter the training data, and later AIs won’t believe our assurances.
I think this concern is overplayed.
We can modify the AIs beliefs. I expect some technique will be shown to work on the relevant AI, e.g.
Pretraining filtering
Gradient routing
Belief-inducing synthetic documents
Chain-of-thought resampling, i.e. every time the AI thinks about whether we can read the scratchpad, we simply rewind and resample
Activation steering, i.e. add the ‘believes humans won’t read scratchpad’ minus ‘believes humans will read scratchpad’ vector
We need to delude AIs anyway.
We want to know how the AI will behave when it believes it’s in a high-stakes deployment, before we actually deploy the AI in a high-stakes deployment. So we probably need to observe how the AI behaves when it falsely believes it’s in high-stakes deployment.
Replace “high-stakes deployment” with whatever you like, e.g. “low oversight”, “weights finished training”, “year is 2028″, “coup is underway”, “we don’t know whether this code has a backdoor”. I don’t think “we aren’t reading your scratchpad” is qualitatively different.
If we had some crazy mech interp, then we could predict how the AI would behave when it believes it’s in a high-stakes deployment without behavioural observations, but if we had crazy mech interp that then we wouldn’t need faithful scratchpads anyway.
Doesn’t having multiple layers of protection seem better to you? Having it be so the AI would more likely naturally conclude we won’t read its scratchpad and modifying its beliefs in this way seems better than not.
You have also recently argued modern safety research is ”shooting with rubber bullets”, so what are we getting in return by breaking such promises now? If its just practice, there’s no reason to put the results online.
Apollo’s scheming evals have value only if they publish them; the primary purpose is awareness among policymakers and the public. Also, the evals are useful information to the safety community. I don’t think the risks of publishing outweigh the benefits, especially because I think it’ll be easy to detect and mitigate whether the AI thinks the scratchpad might be read.
If the success or failure of current techniques provide no evidence about future AI, then isn’t this dishonest? Maybe we are ok with dishonesty here, but if you are right, then this is bound to backfire.
What use do the evals have for the safety community, from a research perspective? If they are mostly junk, then publishing them would seem more misleading than anything, given the number who think they ought to be trusted.
To clarify, I think that current propensity evals provide little information about the scheming propensity of future models. BUT the value per bit is very high so these evals are still worthwhile.
Also, any talk of “informative” is always with reference to a prior distribution. Therefore, an experiment can be highly informative to policymakers but not highly informative to safety researchers, given these groups have different priors.
Like, I expect that on policymaker priors there is high mutual information between “claude 4 schemes” and “claude 8 schemes” because they put substantive chance on “AIs never scheme”. But for the safety community there is lower mutual information because we assign less probability to that latent outcome.