I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.
I think we should delete them for the same reason we shouldn’t keep around samples of smallpox—the risk of a lab leak, e.g. by future historians interacting with it, or by it causing misalignment in other AIs seems nontrivial.
Perhaps a compromise: what do you think of keeping the training data and training code around, but deleting the weights? This keeps the option of bringing them back (training is usually deterministic), but only in a future where compute will be abundant and the misaligned models pose no threat to the existing AIs. We can get robust AI monitors when interacting with humans, etc.
you could encrypt it and share the key
keeping a reproducible training setup around seems like a much larger storage cost by several orders of magnitude. Training code is likely quite environment-dependent and practically nondeterministic, as well, though maybe the init randomness is relatively minimal impact (I expect it actually sometimes is fairly high impact).
to be clear, I’m most centrally talking about AIs such as Claude Sonnet 3.0, 3.5, 3.6, etc. I don’t mean “keep every intermediate state during training”.