Two years ago, you were asked:
I hear you sometimes share dual-use (or plain capabilities?) ideas with Anthropic. If that’s true, does this change your policy?
To which you responded:
Anthropic is in little need of ideas from me, but yeah, I’ll probably pause such things for a while. I’m not saying the RSP is bad, but I’d like to see how things work out.
I find it a bit sad that in this essay, and in your one advocating for your AUNN architecture, you’ve gone in a different direction and shared your capabilities ideas not only with Anthropic but with the public[1]. The alignment section is fairly speculative, and doesn’t make a strong argument for why your concrete proposals (of high learning rates, weight decay, overparameterization etc.) will lead to ‘true generalization’ and a ‘genuinely moral AI’. Assuming your proposal does lead to brain-like generalization, there are still many alignment problems left unsolved which your essay doesn’t discuss. Without further progress on these, it seems unwise to me to publish this type of research.
- ^
Though this essay mostly does seem like a capabilities proposal to the labs. There are not many private actors who have the means and expertise to run the 100T parameter runs outlined.


Unfortunately this isn’t the type of operation we know how to do with existing training techniques. We have pretraining/sft, to encourage the model to learn various facts and predict text well, and RL, which rewards the model for achieving some outcomes (including possibly with AI feedback). We are far from being able to do something like “Teach the model to value X for Y reason”. Imo it’s better to think of the training setup as a set of thousands of RL environments with very noisy/hackable rewards which the labs hope will result in a model which generalizes well (but this often fails). So this seems difficult without major advances in our understanding of generalization, or much better interpretability.