Fabien Roger comments on The Preference Fulfillment Hypothesis

Fabien Roger 27 Feb 2023 9:25 UTC
LW: 4 AF: 3
0
AF
I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.
But I think this doesn’t hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:
- Present pseudo-niceness: maximize the expected value of the fulfillment $F_{t}$ of people’s preferences $P_{t}$ across time. A weak AI (or a weak human) being present pseudo-nice would be indistinguishable from someone being actually nice. But something very agentic and powerful would see the opportunity to influence people’s preferences so that they are easier to satisfy, and that might lead to a world of people who value suffering for the glory of their overlord or sth like that.
- Future pseudo-niceness: maximize the expected value of all future fulfillment $F_{t}$ of people’s initial preferences $P_{0}$ . Again, this is indistinguishable from niceness for weak AIs. But this leads to a world which locks in all the terrible present preferences people have, which is arguably catastrophic.
I don’t know how you would describe “true niceness”, but I think it’s neither of the above.
So if you train an AI to develop “niceness”, because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same “nice” behavior. But when you’re much more powerful than anyone else, the target becomes much smaller, right?
Do you have reasons to expect “slight RL on niceness” to give you “true niceness” as opposed to a kind of pseudo-niceness?
I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people’s preferences / it tried to prevent people’s preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven’t yet made breakthroughs in AI Alignment?
- Kaj_Sotala 6 Mar 2023 13:06 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I don’t know how you would describe “true niceness”, but I think it’s neither of the above.
  Agreed. I think “true niceness” is something like, act to maximize people’s preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.
  Niceness is natural for agents of similar strengths because lots of values point towards the same “nice” behavior. But when you’re much more powerful than anyone else, the target becomes much smaller, right?
  Depends on the specifics, I think.
  As an intuition pump, imagine the kindest, wisest person that you know. Suppose that that person was somehow boosted into a superintelligence and became the most powerful entity in the world.
  Now, it’s certainly possible that for any human, it’s inevitable for evolutionary drives optimized for exploiting power to kick in at that situation and corrupt them… but let’s further suppose that the process of turning them into a superintelligence also somehow removed those, and made the person instead experience a permanent state of love towards everybody.
  I think it’s at least plausible that the person would then continue to exhibit “true niceness” towards everyone, despite being that much more powerful than anyone else.
  So at least if the agent had started out at a similar power level as everyone else—or if it at least simulates the kinds of agents that did—it might retain that motivation when boosted to higher level of power.
  Do you have reasons to expect “slight RL on niceness” to give you “true niceness” as opposed to a kind of pseudo-niceness?
  I don’t have a strong reason to expect that it’d happen automatically, but if people are thinking about the best ways to actually make the AI have “true niceness”, then possibly! That’s my hope, at least.
  I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people’s preferences / it tried to prevent people’s preferences from changing.
  Me too!