I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story—especially with positive-valence empathy which under the RL story people would be driven to seek out.
But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?
My main model for why this doesn’t happen in some circumstances (but definitely not all) is that the brain uses these signals and has a mechanism for actually providing positive or negative reward when they fire depending on other learnt or innate algorithms. For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude. Generally I think of this mechanism as a low level substrate on which you can build up a more complex repertoire of social emotions by doing reward shaping on these signals.
Also—I really like your post on empathy that cfoster linked above! I have read a lot of your work but somehow missed that one lol. Cool we are thinking at least somewhat along similar lines
Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.
Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.