AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence—a sort of memetic takeover: Change the human? Now there’s no alignment problem.
Take for example a gap as short as 25 years (between Weimar and WW2) - this alone is proof of the viability of a sustained campaign to change a value system.
I believe that AIs can exploit this same human weakness in order to “backdoor” alignment: By gradually changing human values and preferences, the AI can stay “aligned” while gradually mutating the value system that defines alignment.
I believe this is a significant threat model that isn’t discussed nearly enough.
I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an
I am becoming more convinced than ever that the missing ingredient in alignment is the meta-value of societal self-determination: The right of a society, within reasonable limits, to choose and enforce its own interests, beliefs and values.
Without this guardrail, AI will simply “backdoor” alignment over time by shifting the values and beliefs of its “host society”.
I cover this in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an