Computing the fastest route to Paris doesn’t involve search?
More generally, I think in order for it to work your example can’t contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn’t follow.
If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).
You are right; my comment was based on a misunderstanding of what you were saying. Hence why I unendorsed it.
(I read ” In this post, I will outline a general category of agents which may exhibit malign generalization without internal search, and then will provide a concrete example of an agent in the category. Then I will argue that, rather than being a very narrow counterexample, this class of agents could be competitive with search-based agents. ” and thought you meant agents that don’t use internal search at all.)