[Question] How could a friendly AI deal with humans trying to sabotage it? (like how present day internet trolls introduce such problems)

After close observation of behaviours online and in person, for a few years, it becomes ever clearer in my mind that a significant fraction (>1%) of the regular adult population would intentionally sabotage any future friendly AI. (speaking of Canada and US, though this probably applies to most if not all countries) Some wouldn’t even particularly be against friendly AI but would still behave destructively for the kicks, such as sadists and so on.

And that’s not to mention career criminals, or the really deranged who probably would be even less inhibited. I would bet even many teenagers would do so just for curiosity’s sake.

The phenomena, I think, would closely mirror the behaviour of present internet trolls when given an opportunity and an anonymizing screen to shield them. Probably a significant fraction would be the same people. As on most large online communities including LW, I assume everyone reading this would have encountered such individuals already so I will spare the examples.

The most straightforward solution I can think of is for the ‘friendly AI’ to treat these people adversarially, as if they were really trying to destroy or confound it, even if they might not actually want to. Excluding those AI that are completely isolated from uncontrolled interaction. But of course this introduces the problem of the supposedly ‘friendly‘ AI needing to ‘combat’, ‘doubt’, ‘interrogate’, ‘oppress’, etc., some percentage of humans in order to carry out its normal functions.

Also, there seems to be a slippery slope because it seems highly unlikely that once any such ‘friendly’ AI regularly carries out such operations that it will be able to resist applying the same methods to the merely annoying humans that are not quite as dangerous. Such as those that reduce trust by breaking written promises, confidences, rules, etc., due to carelessness, temptations, and so on. Then onwards to compulsive alcoholics, gamblers, drug addicts, cultists, and others that negatively affect communities. Then perhaps even seemingly average people who nonetheless ruin conversations they participate in, such as by invoking Godwin’s law or other specious comparisons.

Admittedly, this slippery slope may take decades or centuries to slide down as there are defensible Schelling points, and other factors along the way well discussed on LW, that would counteract this phenomena.