which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.