Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let
LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′))
such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.
Yeah, that’s a good point.
Okay, here’s another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step’s worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.
Maybe we can turn this into a zero-sum game, though? Here’s a proposal: let M′ be a copy of M and Qtree be the set of all questions in the current tree that also get erasures. Then, let LM(Q)=d(H(Q|M′),M(Q))−1|Qtree|−1∑Q′∈Qtree−{Q}d(H(Q′|M′),M′(Q′)) such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It’s still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.
It is also worth noting, however, that even if this works it is a very artificial fix, since the term you’re subtracting is a constant with no dependence on M(Q), so if you’re trying to do gradient descent to optimize this loss, it won’t change anything at all (which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we’re still back at the problem of none of this working unless you’re willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.
Sorry I haven’t followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?
In the case of “actual” IDA, I guess the plan is for each overseer to look inside the model they’re training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I’m not sure how that can happen at the lower levels where the overseers are not very smart.