a) There are true adversarial examples for human values, situations where our values misbehave and we have no way of ever identifying that, in which case we have no hope of solving this problem, because solving it would mean we are in fact able to identify the adversarial examples.
or
b) Humans are actually immune to adversarial examples, in the sense that we can identify the situations in which our values (or rather, a subset of them) would misbehave (like being addicted to social medial), such that our true, complete values never do, and an AI that accurately models humans would also have such immunity.
Regarding 1, it either seems like
a) There are true adversarial examples for human values, situations where our values misbehave and we have no way of ever identifying that, in which case we have no hope of solving this problem, because solving it would mean we are in fact able to identify the adversarial examples.
or
b) Humans are actually immune to adversarial examples, in the sense that we can identify the situations in which our values (or rather, a subset of them) would misbehave (like being addicted to social medial), such that our true, complete values never do, and an AI that accurately models humans would also have such immunity.