German writer of science-fiction novels and children’s books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.
Karl von Wendt
These are valid concerns. If we had a solution to them, I’d be much more relaxed about the future than I currently am. You’re right, in principle, any reward function can be gamed. However, trust as a goal has the specific advantage of going directly against any reward hacking, because this would undermine “justified” long-term trust. An honest strategy simply forbids any kind of reward hacking. This doesn’t mean specification gaming is impossible, but hopefully we’d find a way to make it less likely with a sound definition of what “trust” really means.
I’m not sure what you mean by a “pivotal act”. This post certainly doesn’t claim to be a solution to the alignment problem. We just hope to add something useful to the discussion about it.
Depending on how you define “utility”, I think trust could be seen as a “utility signal”: People trust someone or something because they think it is beneficial to them, respects their values, and so on. One advantage would be that you don’t have to define what exactly these values are—an honest trust-maximizer would find that out for itself and try to adhere to them because this increases trust. Another advantage is the asymmetry described above, which hopefully makes deception less likely (though this is still an open problem). However, a trust maximiser could probably be seen as just one special kind of utility maximiser, so there isn’t a fundamental difference.
“Total expected trust” is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can’t be increased beyond the point where everyone is dead (assuming that a useful definition of “trust” excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
I don’t think that’s very likely. It is in the power of the trust-maximiser to influence the shape of the “trust curve”, both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.
Thank you! You’re absolutely right, we left out the “hard part”, mostly because it’s the really hard part and we don’t have a solution for it. Maybe someone smarter than us will find one.
This is not really what we had in mind. “Trust” in the sense of this post doesn’t mean reliability in an objective, mathematical sense (a lightswitch would be trustworthy in that sense), but instead the fuzzy human concept of trust, which has both a rational and an emotional component—the trust a child has in her mother, or the way a driver trusts that the other cars will follow the same traffic rules he does. This is hard to define precisely, and all measurements are prone to specification gaming, that’s true. On the other hand, it encompasses a lot of instrumental goals that are important for a beneficial AGI, like keeping humanity safe and fostering a culture of openness and honesty.
You’re right, there are a thousand ways an AGI could use deception to manipulate humans into trusting it. But this would be a dishonest strategy. The interesting question to me is whether under certain circumstances, just being honest would be better in the long run. This depends on the actual formulation goal/reward function and the definitions. For example, we could try to define trust in a way that things like force, amnesia, drugs, hypnosis, and other means of influence are ruled out by definition. This is of course not easy, but as stated above, we’re not claiming we’ve solved all problems.
In my experience, people’s typical reaction to discovering that their favorite leader lied is to keep going as usual.
That’s a valid point. However, in these cases, “trust” has two different dimensions. One is the trust in what a leader says, and I believe that even the most loyal followers realize that Putin often lies, so they won’t believe everything he says. The other is trust that the leader is “right for them”, because even with his lies and deception he is beneficial to their own goals. I guess that is what their “trust” is really grounded on—“if Putin wins, I win, so I’ll accept his lies, because they benefit me”. From their perspective, Putin isn’t “evil”, even though they know he lies. If, however, he’d suddenly act against their own interests, they’d feel betrayed, even if he never lied about that.
An honest trust maximizer would have to win both arguments, and to do that it would have to find ways to benefit even groups with conflicting interests, ultimately bridging most of their divisions. This seems like an impossible task, but human leaders have achieved something like that before, reconciling their nations and creating a sense of unity, so a superintelligence should be able to do it as well.
Without independant measurement criteria this could eventually escalate conflict and even decrease overall trust.
“Independent measurement criteria” are certainly needed. The fact that I called trust “fuzzy” doesn’t mean it can’t be defined and measured, just that we didn’t do that here. I think for a trust-maximizer to really be beneficial, we would need at least three additional conditions: 1) A clear definition that rules out all kinds of “fake trust”, like drugging people. 2) A reward function that measures and combines all different kinds of trust in reasonable ways (easier said than done). 3) Some kind of self-regulation that prevents “short-term overoptimizing”—switching to deception to achieve a marginal increase in some measurement of trust. This is a common problem with all utility maximizers, but I think it is solvable, for the simple reason that humans usually somehow avoid overoptimization (take Goethe’s sorcerer’s apprentice as an example—a human would know when “enough is enough”).
… a trust maximizer is likely to fracture mankind into various ideological camps as individual and group preferences vary as to what constitutes an improvement in trust …
i.e. it’s possible to create something even more dangerous than an actively hostile AGI, namely an AGI that is perceived as actively hostile by some portion of the population and genuinely beneficial by some other portion.
I’m not sure whether this would be more dangerous than a paperclip maximizer, but anyway it would clearly go against the goal of maximizing trust in all humans.
We tend to believe that the divisions we see today between different groups (e.g. Democrats vs. Republicans) are unavoidable, so there can never be a universal common understanding and the trust-maximizer would either have to decide which side to be on or deceive both. But that is not true. I live in Germany, a country that has seen the worst and probably the best of how humans can run and peacefully transform a nation. After reunification in 1990, we had a brief period of time when we felt unified as a people, shared common democratic values, and the future seemed bright. Of course, cracks soon appeared and today we are seeing increased division, like almost everywhere else in the world (probably in part driven by attention-maximizing algorithms in social media). But if division can increase, it can also diminish. There was a time when our politicial parties had different views, but a common understanding of how to resolve conflicts in a peaceful and democratic way. There can be such times again.
I personally believe that much of the division and distrust among humans is driven by fear—fear of losing one’s own freedom, standard of living, the future prospects for one’s children, etc. Many people feel left behind, and they look for a culprit, who is presented to them by someone who exploits their fear for selfish purposes. So to create more trust, the trust-maximizer would have the instrumental goal of resolving these conflicts by eliminating the fear that causes them. Humans are unable to do that sufficiently, but a superintelligence might be.
A question for Eliezer: If you were superintelligent, would you destroy the world? If not, why not?
If your answer is “yes” and the same would be true for me and everyone else for some reason I don’t understand, then we’re probably doomed. If it is “no” (or even just “maybe”), then there must be something about the way we humans think that would prevent world destruction even if one of us were ultra-powerful. If we can understand that and transfer it to an AGI, we should be able to prevent destruction, right?
You’re right about the resentment. I guess part of it comes from the fact that East German people have in fact benefited less from the reunification than they had hoped, so there is some real reason for resentment here. However, I don’t think that human happiness is a zero-sum game—quite the opposite. I personally believe that true happiness can only be achieved by making others happy. But of course we live in a world where social media and advertising tell us just the opposite: “Happiness is having more than your neighbor, so buy, buy, buy!” If you believe that, then you’re in a “comparison trap”, and of course not everyone can be the most beautiful, most successful, richest, or whatever, so all others lose. Maybe part of that is in our genes, but it can certainly be overcome by culture or “wisdom”. The ancient philosophers, like Socrates and Buddha, already understood this quite well. Also, I’m not saying that there should never be any conflict between humans. A soccer match may be a good example: There’s a lot of fighting on the field and the teams have (literally) conflicting goals, but all players accept the rules and (to a certain point) trust the referee to be impartial.
“my fellow humans get nice stuff” happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me
This may not be what evolution had “in mind” when it created us. But couldn’t we copy something like this into a machine so that it “thinks” of us (and our descendants) as its “fellow humans” who should “get nice stuff”? I understand that we don’t know how to do that yet. But the fact that Eliezer has some kind of “don’t destroy the world from a fellow human perspective” goal function inside his brain seems to mean a) that such a function exists and b) that it can be coded into a neuronal network, right?
I was also thinking about the specific way we humans weigh competing goals and values against each other. So while for instance we do destroy much of the biosphere by blindly pursuing our misaligned goals, some of us still care about nature and animal welfare and rain forests, and we may even be able to prevent total destruction of them.
My hope was that maybe we can recreate the way we humans make beneficial decisions for fellow beings without simulating a complete brain. But I agree that AGI might be built before we have solved this.
Yes. But my impression so far is that anything we can even imagine in terms of a goal function will go badly wrong somehow. So I find it a bit reassuring that at least one such function that will not necessarily lead to doom seems to exist, even if we don’t know how to encode it yet.
To mandate such a system uniformly across the Earth would effectively mean world dictatorship.
True. To be honest, I don’t see any stable scenario where AGI exists, humanity is still alive and the AGI is not a dictator and/or god, as described by Max Tegmark (https://futureoflife.org/2017/08/28/ai-aftermath-scenarios/).
For example, although it may be possible to change the human psyche to such an extent that positional goods are no longer desired, that would mean creating a new type of person.
I don’t think so. First of all, positional goods can exist and they can lead to conflicts, as long as everyone thinks that these conflicts are resolved fairly. For example, in our capitalistic world, it is okay that some people are rich as long as they got rich by playing by the rules and just being inventive or clever. We still trust the legal system that makes this possible even though we may envy them.
Second, I think much of our focus on positional goods comes from our culture and the way our society is organized. In terms of our evolutionary history, we’re optimized for living in tribes of around 150 people. There were social hierarchies and even fights for supremacy, but also ways to resolve these conflicts peacefully. A perfect benevolent dictator might reestablish this kind of social structure, with much more “togetherness” than we experience in our modern world and much less focus on individual status and possessions. I may be a bit naive here, of course. But from my own life experience it seems clear that positional goods are by far not as important as most people seem to think. You’re right, many people would resent these changes at first. But a superintelligent AGI with intense knowledge of the human psyche might find ways to win them over, without force or deception, and without changing them genetically, through drugs, etc.
I was thinking more about the way psychologists try to understand the way we make decisions. I stumbled across two papers from a few years ago by such a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don’t contain any specific ideas on how to implement this. But maybe Muraven has a point here. Maybe we should put more effort into understanding the way we humans deal with goals, instead of letting an AI figure it out for itself through RL or IRL.
I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of “value learning”. (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv.org/abs/1703.06354).They appear a bit shallow to me and don’t contain any specific ideas on how to implement this. But maybe Muraven has a point here.
For such a superintelligence to ‘win them over’, the world dictatorship, or a similar scheme, must already have been established. Worrying about this seems to be putting the cart before the horse as the superintelligence will be an implementation detail compared to the difficulty of establishing the scenario in the first place.
Agreed.
Why should we bother about whatever comes after? Obviously whomever successfully establishes such a regime will be vastly greater than us in perception, foresight, competence, etc., we should leave it to them to decide.
Again, agreed—that’s why I think a “benevolent dictator” scenario is the only realistic option where there’s AGI and we’re not all dead. Of course, what kind of benevolent will be a matter of its goal function. If we can somehow make it “love” us the way a mother loves her children, then maybe trust in it would really be justified.
If you suppose that superintelligent champion of trust maximization bootstraps itself into such a scenario, instead of some ubermensch, then the same still applies, though less likely as rival factions may have created rival superintelligences to champion their causes as well.
This is of course highly speculative, but I don’t think that a scenario with more than one AGI will be stable for long. As a superintelligence can improve itself, they’d all grow exponentially in intelligence, but that means the differences between them grow exponentially as well. Soon one of them would outcompete all others by a large margin and either switch them off or change their goals so they’re aligned with it. This wouldn’t be like a war between two human nations, but like a war between humans and, say, frogs. Of course, we humans would even be much lower than frogs in this comparison, maybe insect level. So a lot hinges on whether the “right” AGI wins this race.
Yes, thank you!
Your concern is justified if the trust-maximizer only maximizes short-term trust. This depends on the discount of future cumulated trust given in its goal function. In an ideal goal function, there would be a balance between short-term and long-term trust, so that honesty would pay out in the long term, but there wouldn’t be an incentive to postpone all trust into the far future. This is certainly a difficult balance.