You’re right, there are a thousand ways an AGI could use deception to manipulate humans into trusting it. But this would be a dishonest strategy. The interesting question to me is whether under certain circumstances, just being honest would be better in the long run. This depends on the actual formulation goal/reward function and the definitions. For example, we could try to define trust in a way that things like force, amnesia, drugs, hypnosis, and other means of influence are ruled out by definition. This is of course not easy, but as stated above, we’re not claiming we’ve solved all problems.
In my experience, people’s typical reaction to discovering that their favorite leader lied is to keep going as usual.
That’s a valid point. However, in these cases, “trust” has two different dimensions. One is the trust in what a leader says, and I believe that even the most loyal followers realize that Putin often lies, so they won’t believe everything he says. The other is trust that the leader is “right for them”, because even with his lies and deception he is beneficial to their own goals. I guess that is what their “trust” is really grounded on—“if Putin wins, I win, so I’ll accept his lies, because they benefit me”. From their perspective, Putin isn’t “evil”, even though they know he lies. If, however, he’d suddenly act against their own interests, they’d feel betrayed, even if he never lied about that.
An honest trust maximizer would have to win both arguments, and to do that it would have to find ways to benefit even groups with conflicting interests, ultimately bridging most of their divisions. This seems like an impossible task, but human leaders have achieved something like that before, reconciling their nations and creating a sense of unity, so a superintelligence should be able to do it as well.
You’re right, there are a thousand ways an AGI could use deception to manipulate humans into trusting it. But this would be a dishonest strategy. The interesting question to me is whether under certain circumstances, just being honest would be better in the long run. This depends on the actual formulation goal/reward function and the definitions. For example, we could try to define trust in a way that things like force, amnesia, drugs, hypnosis, and other means of influence are ruled out by definition. This is of course not easy, but as stated above, we’re not claiming we’ve solved all problems.
That’s a valid point. However, in these cases, “trust” has two different dimensions. One is the trust in what a leader says, and I believe that even the most loyal followers realize that Putin often lies, so they won’t believe everything he says. The other is trust that the leader is “right for them”, because even with his lies and deception he is beneficial to their own goals. I guess that is what their “trust” is really grounded on—“if Putin wins, I win, so I’ll accept his lies, because they benefit me”. From their perspective, Putin isn’t “evil”, even though they know he lies. If, however, he’d suddenly act against their own interests, they’d feel betrayed, even if he never lied about that.
An honest trust maximizer would have to win both arguments, and to do that it would have to find ways to benefit even groups with conflicting interests, ultimately bridging most of their divisions. This seems like an impossible task, but human leaders have achieved something like that before, reconciling their nations and creating a sense of unity, so a superintelligence should be able to do it as well.