There is a fourth option: the “safe” set of values can be misaligned with humans’ actual values. Some values that humans have are either not listed in the “safe” set of values, or something in the safe set of values would not quite align with what it was trying to represent.
As a specific example, consider how a human might have defined values a few centuries ago.”Hmm, what value system should we build our society on? Aha! The seven heavenly virtues! Every utopian society must encourage chastity, temperance, charity, diligence, patience, kindness, and humility!”. Then, later, someone tries to put happiness somewhere in the list. However, since this was not put into the constrained optimization function it becomes a challenge to optimize for it.
This is NOT something that would only happen in the past. If an AI based it’s values today on what the majority agrees is a good idea, things like marijuana would be banned and survival would be replaced by “security” or something else slightly wrong.