I was reading Towards Safe and Honest AI Agents with Neural Self-Other Overlap
and I noticed a problem with it
It also penalizing realizing that other people want different things than you, forcing an overlap between (thing I like) and (things you will like). This both means that one, it will be forced to reason like it likes what you do, which is a positive. But it will also likely overlap (You know what is best for you) and (I know what is best for me), which might lead to stubborness, and worse, it could also get (I know what is best for you) overlapping them both, which might be really bad. (You know what is best for me) is actually fine though, since that is basically the corrigibiltiy basis.
We still need to be concerned that the model will reason symettrically, but assign to us different values than we actually have and thus exhibit patronizing but misaligned behaviour
I was reading Towards Safe and Honest AI Agents with Neural Self-Other Overlap
and I noticed a problem with it
It also penalizing realizing that other people want different things than you, forcing an overlap between (thing I like) and (things you will like). This both means that one, it will be forced to reason like it likes what you do, which is a positive. But it will also likely overlap (You know what is best for you) and (I know what is best for me), which might lead to stubborness, and worse, it could also get (I know what is best for you) overlapping them both, which might be really bad. (You know what is best for me) is actually fine though, since that is basically the corrigibiltiy basis.
We still need to be concerned that the model will reason symettrically, but assign to us different values than we actually have and thus exhibit patronizing but misaligned behaviour