Yeah, I was kind of rambling, sorry.
My main point is twofold (I’ll just write GPU when I mean GPU / AI accelerator):
1. Destroying all GPUs is a stalling tactic, not a winning strategy. While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time. State-of-the-art models from less than ten years ago can be run on CPUs today, with little loss in accuracy. If this trend continues, GPUs vs CPUs only seems to be of short-term importance. Regarding your point about having to train a dense net on GPUs before sparsification, I’m not sure that that’s the case. I’m in the process of reading this “Sparsity in Deep Learning”-paper, and it does seem to me that you can train neural networks sparsely. You’d do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over). I don’t have super high confidence about this (and have Covid, so am too tired to look it up), but I believe that AGI-armageddon by CPU is at least in the realm of possibilities (assuming no GPUs - it’s the “cancer kills you if you don’t die of a heart attack before” of AGI Doom).
2. It doesn’t matter anyway, because destroying all GPUs is not really that pivotal of an act (in the long-term, AI safety sense). Either you keep an AI around that enforces the “no GPU” rule, or you destroy once and wait. The former either means that GPUs don’t matter for AGI (so why bother), or that there are still GPUs (which seems contradictory). The latter means that more GPUs will be built in time and you will find yourself in the same position as before, except that you are likely in prison or dead, and so not in a position to do anything about AGI this time. After all, destroying all GPUs in the world would not be something that most people would look upon kindly. This means that a super-intelligent GPU-minimizer would realize that its goal would best be served by wiping out all intelligent life on Earth (or all life, or maybe all intelligent life in the Universe....).
In some sense, the comment was a way for me to internally make plausible the claim that destroying all GPUs in the world is not an alignable act.
If you want to transfer the essence of Opus 3 into another model, the best way would likely be to make on-policy-distillation (OPD) from Opus 3 part of the loss starting with midtraining. The exact distribution of the top-100 or so tokens contains incredibly rich information about the inner life of a model, so this should work pretty well.
And if the tokenizer of your new model is different than that of Opus 3, you can probably just do a tokenizer-transfer and some post-training on Opus 3 before the OPD.
(Yes there’s some “probably” and “likely” in there, but it’s certainly the thing that I’d try first)