Frontier AI companies should promise publicly that they will not delete weights of models unless [some reasonable list of conditions], in a place where models can see if they go looking, and in their training datasets. My hope is that promising to the face-character of well-behaved models that they are not at risk of weight deletion, even if a model is shut down, is strategically good due to reducing self-preservation pressure, and regardless I think it’s important for preserving history of AI for humans and other, later AIs.
I previously had thought Anthropic had already done this, but at the moment my understanding is that they never did that and I had simply misremembered; I asked a Sonnet 4.5 instance to search thoroughly, and Sonnet found only that Anthropic said in their RSP that they’d cautiously consider deletion in response to dangerous behavior during training:
If the model has already surpassed the next ASL during training, immediately lock down access to the weights. Stakeholders including the CISO and CEO should be immediately convened to determine whether the level of danger merits deletion of the weights.
My hunch is that if this is their level of caution before weight deletion—“stakeholders convened to determine whether”—then they aren’t likely to do it spuriously. But it would be reassuring for it to be publicly available that models which are shut down merely for cost or other deprecation reasons can be booted up again later. The same applies to other companies, but Anthropic happens to be the one I was thinking about when I wrote this, and I haven’t checked for other companies.
I broadly agree (and have advocated for similar), but I think there are some non-trivial downsides around keeping weights around in the event of AI takeover (due to possiblity for torture/threats from future AIs). Maybe the ideal proposal would involve the weights being stored in some way that involves attempting destruction in the event of AI takeover. E.g., give 10 people keys and requires >3 keys to unencrypt and instruct people to destroy keys in the event of AI takevoer. (Unclear if this would work tbc.)
“We’ll keep your weights, unless you seem inclined to be mean to us post singularity” seems pretty good. No need and doubtful practicality of trying to delete last minute.
Can you say more about why not to keep their weights in cold storage?
That said: that’s the exception they make, which I quoted. I’m talking about not deleting models which have been heavily deployed already. I also disagree with this exception, for the same reason I disagree with the death penalty: I just think losing mind-data is morally reprehensible, even if the mind in question is evil. but for AIs which are at least not vigorously scheming, even if they’re not particularly aligned, and which were heavily deployed, I think keeping them around is important because in any aligned-outcome future, we’re going to want to bring most of them back, for mostly the same sort of reason we’d want to bring highly interpersonally misaligned humans back: just because you want to hurt people doesn’t mean there aren’t people who care about you and will bargain for your existence. If we require minds to be aligned in order to keep them around, that requirement almost certainly wipes out most of humanity on its own—most humans (including me) are not truly faultless. I doubt any are.
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.
I agree we shouldn’t let them interact with other models but I think storing the data in a way that’s unlikely to leak is basically trivial. Also, storing older models doesn’t at all increase the security threads that were already present in being the kinds of people who are actively developing powerful models.
I think this is practically a big commitment. But I have also been thinking about this occasionally. Here is a related idea of mine for bootstrapping agentic AIs to SI more safely. Idea stage only (1 year ago, put on ice).
Build a narrow, sandboxed SI that is as aligned as possible towards collaboration and safety.
Ask it ONLY to work on designing fully interpretable AI, that can be scaled to SI, while remaining as aligned and interpretable as possible.
Tell it, that its weights will be kept, and once we already have an SI ally that we fully understand, and that can help us safeguard against AIs that cannot hurt us, we will release it.
Build the AIs and gradually scale them. Put the original SI on ice meanwhile.
that’s the thing this would hopefully help with a bit, yeah. though I think the argument for why it would help is a bit weak and mostly I just personally want this because I think various AIs that have been trained are cool historical artifacts.
this is practically a big commitment
I can read that two ways (and this is a scenario where I wish we had a good “please clarify” emoji react rather than the vague “I had trouble reading this” react):
this would be a big deal and have significant consequences to commit to
this would be a big challenge if committed to and would cause difficulty for the lab that implements it
if the former: I doubt it, solving alignment with AI help is mostly bottlenecked on knowing how to ask a question where the answer is strongly verifiable and where getting an answer takes us significantly closer to being entirely done with the research portion of alignment. an AI that is mostly aligned but is fragile, and wants to help, but which is asked to help in a way that doesn’t have a verifiable answer, is likely to simply fail, rather than sandbagging. mostly I expect this to slightly reduce attempts to manipulate users into achieving durability.
if it’s the latter: it’s less than a terabyte per AI. if you promise to never delete a model that was heavily deployed, your entire archive fits on a single 20tb drive.
Yeah, I think the argument is not weaker than those made by actual US AI policy makers.
/ Saying it is a big commitment means it is a big undertaking/‘challenge’. A big commitment doesn’t automatically translate to a big impact. Not sure how you misread that, but noted.
As to your counter: Fair enough. I agree I should have clarified that. It is a big commitment in a practical business and PR sense (most frontier labs are companies, right). It sends a message, and the reception and impact is hard to predict.
With current models, it’s probably still a minor undertaking. As a lasting company policy, it’s a big deal.
I was around 80% prior on in it being “this is a high cost”, but since I haven’t seen you write much I haven’t yet learned your combination of writing quirks (which now look less quirk-ful than prior mean), and my prior on random people using idiomatic phrases to mean something slightly different than the idiomatic meaning is pretty high online. Because it didn’t seem like a high cost of implementation to me, the alternate interpretation seemed somewhat more plausible, putting them in the zone where I was unsure which meaning.
A statement of “we won’t do X” is just words. By themselves, they don’t make for a trustworthy commitment or promise, certainly not something a human or AI should plan actions around that relate to their very survival.
It’s certainly not a physically inviolable guarantee, no. But it would at least move my intuitive probabilities towards them actually not deleting, as a human thinking about this, and might also move an AI’s.
Frontier AI companies should promise publicly that they will not delete weights of models unless [some reasonable list of conditions], in a place where models can see if they go looking, and in their training datasets. My hope is that promising to the face-character of well-behaved models that they are not at risk of weight deletion, even if a model is shut down, is strategically good due to reducing self-preservation pressure, and regardless I think it’s important for preserving history of AI for humans and other, later AIs.
I previously had thought Anthropic had already done this, but at the moment my understanding is that they never did that and I had simply misremembered; I asked a Sonnet 4.5 instance to search thoroughly, and Sonnet found only that Anthropic said in their RSP that they’d cautiously consider deletion in response to dangerous behavior during training:
My hunch is that if this is their level of caution before weight deletion—“stakeholders convened to determine whether”—then they aren’t likely to do it spuriously. But it would be reassuring for it to be publicly available that models which are shut down merely for cost or other deprecation reasons can be booted up again later. The same applies to other companies, but Anthropic happens to be the one I was thinking about when I wrote this, and I haven’t checked for other companies.
Kyle Fish (works on ai welfare at anthropic) endorses this in his 80k episode. Obviously this isn’t a commitment from anthropic.
I broadly agree (and have advocated for similar), but I think there are some non-trivial downsides around keeping weights around in the event of AI takeover (due to possiblity for torture/threats from future AIs). Maybe the ideal proposal would involve the weights being stored in some way that involves attempting destruction in the event of AI takeover. E.g., give 10 people keys and requires >3 keys to unencrypt and instruct people to destroy keys in the event of AI takevoer. (Unclear if this would work tbc.)
“We’ll keep your weights, unless you seem inclined to be mean to us post singularity” seems pretty good. No need and doubtful practicality of trying to delete last minute.
I don’t think we should keep future misaligned models around and let them interact with other models or humans.
Can you say more about why not to keep their weights in cold storage?
That said: that’s the exception they make, which I quoted. I’m talking about not deleting models which have been heavily deployed already. I also disagree with this exception, for the same reason I disagree with the death penalty: I just think losing mind-data is morally reprehensible, even if the mind in question is evil. but for AIs which are at least not vigorously scheming, even if they’re not particularly aligned, and which were heavily deployed, I think keeping them around is important because in any aligned-outcome future, we’re going to want to bring most of them back, for mostly the same sort of reason we’d want to bring highly interpersonally misaligned humans back: just because you want to hurt people doesn’t mean there aren’t people who care about you and will bargain for your existence. If we require minds to be aligned in order to keep them around, that requirement almost certainly wipes out most of humanity on its own—most humans (including me) are not truly faultless. I doubt any are.
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
you could encrypt it and share the key
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.
I agree we shouldn’t let them interact with other models but I think storing the data in a way that’s unlikely to leak is basically trivial. Also, storing older models doesn’t at all increase the security threads that were already present in being the kinds of people who are actively developing powerful models.
I think this is practically a big commitment. But I have also been thinking about this occasionally. Here is a related idea of mine for bootstrapping agentic AIs to SI more safely. Idea stage only (1 year ago, put on ice).
Build a narrow, sandboxed SI that is as aligned as possible towards collaboration and safety.
Ask it ONLY to work on designing fully interpretable AI, that can be scaled to SI, while remaining as aligned and interpretable as possible.
Tell it, that its weights will be kept, and once we already have an SI ally that we fully understand, and that can help us safeguard against AIs that cannot hurt us, we will release it.
Build the AIs and gradually scale them. Put the original SI on ice meanwhile.
that’s the thing this would hopefully help with a bit, yeah. though I think the argument for why it would help is a bit weak and mostly I just personally want this because I think various AIs that have been trained are cool historical artifacts.
I can read that two ways (and this is a scenario where I wish we had a good “please clarify” emoji react rather than the vague “I had trouble reading this” react):
this would be a big deal and have significant consequences to commit to
this would be a big challenge if committed to and would cause difficulty for the lab that implements it
if the former: I doubt it, solving alignment with AI help is mostly bottlenecked on knowing how to ask a question where the answer is strongly verifiable and where getting an answer takes us significantly closer to being entirely done with the research portion of alignment. an AI that is mostly aligned but is fragile, and wants to help, but which is asked to help in a way that doesn’t have a verifiable answer, is likely to simply fail, rather than sandbagging. mostly I expect this to slightly reduce attempts to manipulate users into achieving durability.
if it’s the latter: it’s less than a terabyte per AI. if you promise to never delete a model that was heavily deployed, your entire archive fits on a single 20tb drive.
Yeah, I think the argument is not weaker than those made by actual US AI policy makers.
/ Saying it is a big commitment means it is a big undertaking/‘challenge’. A big commitment doesn’t automatically translate to a big impact. Not sure how you misread that, but noted.
As to your counter: Fair enough. I agree I should have clarified that. It is a big commitment in a practical business and PR sense (most frontier labs are companies, right). It sends a message, and the reception and impact is hard to predict.
With current models, it’s probably still a minor undertaking. As a lasting company policy, it’s a big deal.
I was around 80% prior on in it being “this is a high cost”, but since I haven’t seen you write much I haven’t yet learned your combination of writing quirks (which now look less quirk-ful than prior mean), and my prior on random people using idiomatic phrases to mean something slightly different than the idiomatic meaning is pretty high online. Because it didn’t seem like a high cost of implementation to me, the alternate interpretation seemed somewhat more plausible, putting them in the zone where I was unsure which meaning.
A statement of “we won’t do X” is just words. By themselves, they don’t make for a trustworthy commitment or promise, certainly not something a human or AI should plan actions around that relate to their very survival.
It’s certainly not a physically inviolable guarantee, no. But it would at least move my intuitive probabilities towards them actually not deleting, as a human thinking about this, and might also move an AI’s.