Using our understanding of abstraction, agent signatures and modularity, we then locate the agent’s (or the subagents’) goals inside it (them), and “freeze” the corresponding parameters. Then we start allowing the agent(s) to evolve into superintelligent regimes, varying all the parameters except these frozen ones. Using our quantitative understanding of selection, we shape this second training phase such that it’s always preferred for the AGI to keep using the old frozen sections as its inner goal(s), instead of evolving new ones. This gets us a superintelligent AGI that still has the aligned values of the old, dumb one.
It’s not clear to me that this would make a difference in alignment. My understanding of the Risks from Learned Optimization story is that at some stage below superintelligence, the agent’s architecture is fixed into robust alignment, corrigiblity or deception. So if you’ve already created an agent in phase 1 that will generalize to human values correctly, won’t it continue to generalize correctly when trained further?
If you have something that’s smart enough to figure out the training environment, the dynamics of gradient descent, and its own parameters, then yes, I expect it could do a pretty good job at preserving its goals while being modified. But that’s explicitly not what we have here. An agent that isn’t smart enough to not know how to trick out your training process so it doesn’t get modified to have human values probably also isn’t smart enough to then tell you how to preserve these values during further training.
Or at least, the chance of the intelligence thresholds working out like that does not sound to me like something you want to base a security strategy on.
It’s not clear to me that this would make a difference in alignment. My understanding of the Risks from Learned Optimization story is that at some stage below superintelligence, the agent’s architecture is fixed into robust alignment, corrigiblity or deception. So if you’ve already created an agent in phase 1 that will generalize to human values correctly, won’t it continue to generalize correctly when trained further?
If you have something that’s smart enough to figure out the training environment, the dynamics of gradient descent, and its own parameters, then yes, I expect it could do a pretty good job at preserving its goals while being modified. But that’s explicitly not what we have here. An agent that isn’t smart enough to not know how to trick out your training process so it doesn’t get modified to have human values probably also isn’t smart enough to then tell you how to preserve these values during further training.
Or at least, the chance of the intelligence thresholds working out like that does not sound to me like something you want to base a security strategy on.