Oops. Then I don’t get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn’t affect the technique ⇒ the technique stops working in a gradual manner and one can do empirical studies on current models).
And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there’s no way to find them.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
The question is if the attractor is big enough. And given how there’s various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren’t obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
Wrt your concrete attractor: if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us. And a human in the loop doesn’t help if the AI’s proposals are inscrutable to us bc then we’ll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.