[Question] What are some examples of AIs instantiating the ‘nearest unblocked strategy problem’?

A paragraph explaining the problem, from Ngo, Chan, and Mindermann (2023). I’ve bolded the key part:

Our definition of internally-represented goals is consistent with policies learning multiple goals during training, including some aligned and some misaligned goals, which might interact in complex ways to determine their behavior in novel situations (analogous to humans facing conflicts between multiple psychological drives). With luck, AGIs which learn some misaligned goals will also learn aligned goals which prevent serious misbehavior even outside the RL fine-tuning distribution. However, the robustness of this hope is challenged by the nearest unblocked strategy problem [Yudkowsky, 2015]: the problem that an AI which strongly optimizes for a (misaligned) goal will exploit even small loopholes in (aligned) constraints, which may lead to arbitrarily bad outcomes [Zhuang and Hadfield-Menell, 2020]. For example, consider a policy which has learned both the goal of honesty and the goal of making as much money as possible, and is capable of generating and pursuing a wide range of novel strategies for making money. If there are even small deviations between the policy’s learned goal of honesty and our concept of honesty, those strategies will likely include some which are classified by the policy as honest while being dishonest by our standards. As we develop AGIs whose capabilities generalize to an increasingly wide range of situations, it will therefore become increasingly problematic to assume that their aligned goals are loophole-free.

LLMs being vulnerable to jailbreaks seems like a decent example. Are there others?

No comments.