I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.
Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner1-objective be the objective used by the trained model for the in-context gradient descent process, and the inner1-optimizer be the in-context gradient descent process. Then it seems plausible the inner1-optimizer will itself instantiate an inner objective and optimizer, call these inner2-objectives, and -optimizers. And again an inner3-objective and -optimizer may be made, and so on.
Thus, another risk model in value-instability: Recursive inner-alignment. Though we may solve inner1-alignment, inner2-alignment may not be solved, nor innern-alignment for any n>1.
I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.
Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner1-objective be the objective used by the trained model for the in-context gradient descent process, and the inner1-optimizer be the in-context gradient descent process. Then it seems plausible the inner1-optimizer will itself instantiate an inner objective and optimizer, call these inner2-objectives, and -optimizers. And again an inner3-objective and -optimizer may be made, and so on.
Thus, another risk model in value-instability: Recursive inner-alignment. Though we may solve inner1-alignment, inner2-alignment may not be solved, nor innern-alignment for any n>1.