I think that piecewise objectives are quite reasonable and natural—and I don’t think they’ll make transparency that much harder. I don’t think there’s any reason that we should expect objectives to be continuous in some nice way, so I fully expect you’ll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.
I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I’m still pretty uncertain about it—but I don’t think that a switch-case agent is such an example. Probably the case that I’m most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it’s the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent’s objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent’s own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you’re still using search to get your robust capabilities.
If one’s interpretation of the ‘objective’ of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it’s maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn’t shed light on its internal operations any more than just spelling out exactly what its source code is.
I don’t feel like you’re really understanding what I’m trying to say here. I’m happy to chat with you about this more over video call or something if you’re interested.
Sure, we can talk about this over video. Check your Facebook messages.