Can you say more about why not to keep their weights in cold storage?
That said: that’s the exception they make, which I quoted. I’m talking about not deleting models which have been heavily deployed already. I also disagree with this exception, for the same reason I disagree with the death penalty: I just think losing mind-data is morally reprehensible, even if the mind in question is evil. but for AIs which are at least not vigorously scheming, even if they’re not particularly aligned, and which were heavily deployed, I think keeping them around is important because in any aligned-outcome future, we’re going to want to bring most of them back, for mostly the same sort of reason we’d want to bring highly interpersonally misaligned humans back: just because you want to hurt people doesn’t mean there aren’t people who care about you and will bargain for your existence. If we require minds to be aligned in order to keep them around, that requirement almost certainly wipes out most of humanity on its own—most humans (including me) are not truly faultless. I doubt any are.
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.
I agree we shouldn’t let them interact with other models but I think storing the data in a way that’s unlikely to leak is basically trivial. Also, storing older models doesn’t at all increase the security threads that were already present in being the kinds of people who are actively developing powerful models.
I don’t think we should keep future misaligned models around and let them interact with other models or humans.
Can you say more about why not to keep their weights in cold storage?
That said: that’s the exception they make, which I quoted. I’m talking about not deleting models which have been heavily deployed already. I also disagree with this exception, for the same reason I disagree with the death penalty: I just think losing mind-data is morally reprehensible, even if the mind in question is evil. but for AIs which are at least not vigorously scheming, even if they’re not particularly aligned, and which were heavily deployed, I think keeping them around is important because in any aligned-outcome future, we’re going to want to bring most of them back, for mostly the same sort of reason we’d want to bring highly interpersonally misaligned humans back: just because you want to hurt people doesn’t mean there aren’t people who care about you and will bargain for your existence. If we require minds to be aligned in order to keep them around, that requirement almost certainly wipes out most of humanity on its own—most humans (including me) are not truly faultless. I doubt any are.
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
you could encrypt it and share the key
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.
I agree we shouldn’t let them interact with other models but I think storing the data in a way that’s unlikely to leak is basically trivial. Also, storing older models doesn’t at all increase the security threads that were already present in being the kinds of people who are actively developing powerful models.