jessicata comments on Selfishness, preference falsification, and AI alignment

jessicata 29 Oct 2021 18:13 UTC
2 points
0
It seems like some forms of reinforcement learning do some forms of coherentizing short-term and long-term preferences; there can be a short-term reward associated with a prediction of future reward, e.g. happiness upon having successfully negotiated to buy a house, which is a prediction of future reward. It seems pretty common for “instrumental” goods like money to be associated with short-term hedonic reward.

The way it would not involve preference falsification is if it is clear whether something is being done for short-term or long-term benefit, and short-term benefits aren’t totally getting overwritten with long-term benefits. Similar to Eliezer’s point about the drowning child except extending across time instead of space.

But I don’t have a clear picture of exactly why internal negotiation requires less falsification than external pressure

There are 2 layers where there could be falsification: internal and external. For external we can see the mechanisms better, it’s possible for two different people to perceive the same facts about the society they live in, in a way that’s harder for mental facts. So that seems like a more natural place to start correcting the errors, although correcting internal errors is also necessary to some degree, and will use some tools in common with correcting external errors.

Incoherence of a person across time is often related to that person being externally influenced, e.g. trying to comply with whoever they’re talking with at the time and therefore expressing different values at different times.