More reasons to worry about relying on constraints:
As you say, your constraints might be insufficiently general (‘nearest unblocked strategy,’ etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just ‘outer shell,’ like humans’ instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its terminal values.
Your constraints might actually be false beliefs that later get revised (e.g. that lying never pays)(Barnett and Gillen).
Your constraints might cause theoretical problems that motivate the AI to revise them away (e.g. money pumps, intransitivities, violations of the Independence of Irrelevant Alternatives, implausible dependence on which outcome is designated as the status quo, paralysis, trouble dealing with risk, arbitrariness of constraints’ exact boundaries).
Your constraints might cause other misalignments (e.g. the AI wants to take extreme measures to prevent other agents from lying too).
Your constraints might make the AI incapable (e.g. they might falsify the strategy-stealing assumption, or make AIs too timid [e.g. maybe the AI will be extremely reluctant to say anything it’s not absolutely certain of]).
Your constraints might fail to motivate the AI to do good alignment work (e.g. the AI produces alignment slop).
Your constraints might make the AI bad at moral philosophy (and we might need AI-powered moral philosophy to get a really good future).
More reasons to worry about relying on constraints:
As you say, your constraints might be insufficiently general (‘nearest unblocked strategy,’ etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just ‘outer shell,’ like humans’ instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its terminal values.
Your constraints might actually be false beliefs that later get revised (e.g. that lying never pays)(Barnett and Gillen).
Your constraints might cause theoretical problems that motivate the AI to revise them away (e.g. money pumps, intransitivities, violations of the Independence of Irrelevant Alternatives, implausible dependence on which outcome is designated as the status quo, paralysis, trouble dealing with risk, arbitrariness of constraints’ exact boundaries).
Your constraints might cause other misalignments (e.g. the AI wants to take extreme measures to prevent other agents from lying too).
Your constraints might make the AI incapable (e.g. they might falsify the strategy-stealing assumption, or make AIs too timid [e.g. maybe the AI will be extremely reluctant to say anything it’s not absolutely certain of]).
Your constraints might fail to motivate the AI to do good alignment work (e.g. the AI produces alignment slop).
Your constraints might make the AI bad at moral philosophy (and we might need AI-powered moral philosophy to get a really good future).