Planned summary for the Alignment Newsletter:
This post argues that agents can have <@capability generalization without objective generalization@>(@2-D Robustness@), _without_ having an agent that does internal search in pursuit of a simple mesa objective. Consider an agent that learns different heuristics for different situations which it selects from using a switch statement. For example, in lunar lander, if at training time the landing pad is always red, the agent may learn a heuristic about which thrusters to apply based on the position of red ground relative to the lander. The post argues that this selection across heuristics could still happen with very complex agents (though the heuristics themselves may involve search).
I generally agree that you could get powerful agents that nonetheless are “following heuristics” rather than “doing search”; however, others with differing intuitions did not find this post convincing.