Hypothesis about how social stuff works and arises
(I can’t be bothered to write a real Serious Post, so I’m just going to write this like a tumblr post. y’all are tryhards with writing and it’s boooooring, and also I have a lot of tangentially related stuff to say. Pls critique based on content. If something is unclear, quote it and ask for clarification)
Alright so, this is intended to be an explicit description that, hopefully, could be turned into an actual program, that would generate the same low-level behavior as the way social stuff arises from brains. Any divergence is a mistake, and should be called out and corrected. it is not intended to be a fake framework. it’s either actually a description of parts of the causal graph that are above a threshold level of impact, or it’s wrong. It’s hopefully also a good framework. I’m pretty sure it’s wrong in important ways, I’d like to hear what people suggest to improve it.
Recommended knowledge: vague understanding of what’s known about how the cortex sheet implements fast inference/how “system 1” works, how human reward works, etc, and/or how ANNs work, how reinforcement learning works, etc.
The hope is that the computational model would generate social stuff we actually see, as high-probability special cases—in semi-technical terms you can ignore if you want, I’m hopeful it’s a good causal/generative model, aka that it allows compressing common social patterns with at least somewhat accurate causal graphs.
So we’re making an executable model of part of the brain, so I’m going to write it as a series of changes I’m going to make. (I’m uncomfortable with the structured-ness of this, if anyone has any ideas for how to generalize it, that would be helpful.)
To start our brain thingy off, add direct preferences: experiences our new brain wants to have. Make negative things much worse, maybe around 5x, than good things.
From the inside, this is an experience that in-the-moment is enjoyable/satisfying/juicy/fun/rewarding/attractive to you/thrilling/etc etc. Basic stuff like drinking water, having snuggles, being accepted, etc—preferences that are nature and not nurture.
From the outside, this is something like the experience producing dopamine/serotonin/endorphin/oxytocin/etc in, like, a young child or something—ie, it’s natively rewarding.
In the implementable form of this model, our reinforcement learner needs a state-reward function.
Social sort of exists here, but only in the form that if an agent can give something you want, such as snuggles, then you want that interaction.
Then, make the direct preferences update by pulling the rewards back through time.
From the inside, this is the experience of things that lead to rewarding things becoming rewarding themselves—operant conditioning and preferences that come from nurture, eg complex flavor preferences, room layout preferences, preferences for stability, preferences for hygiene being easy, preferences for stability, etc.
From the outside, this is how dopamine release and such happens when a stimulus is presented that indicates an increase in future reward
In the implementable form of this model, this is any temporal difference learning technique, such as q learning
Social exists more here, in that our agent learns which agents reliably produce experiences are level-1 preferred vs dispreferred. If there’s a level-1 boring/dragging/painful/etc thing another agent does, it might result in an update towards lower probability of good interactions with that agent in that context. If there’s a level-1 fun/good/satisfying/etc thing another agent does, it might result in an update towards that agent being good to interact with in that context and maybe in others.
Then, modify preferences to deal with one-on-one interactions with other agents:
Add tracking of retribution for other agents
From the inside, this is feeling that you are your own person, getting angry if someone does something you don’t like, and becoming less angry if you feel that they’re actually sorry.
From the outside, this is people being quick to anger and not thinking things through before getting angry about Bad Things. something about SNS as well. I’m less familiar with neural implementation of anger.
To implement: Track retribution-worthiness of the other agent. Increase it if the other agent does something you consider retribution-worthy. Initialize what’s retribution-worthy to be “anything that hurts me”. Initialize retribution-worthiness of other agents to be zero. Decrease retribution-worthiness once retribution has been enacted and accepted as itself not retribution-worthy by the other agent.
Track deservingness/caring-for other agents. Keep decreasing an agents’ deservingness open as an option for how to enact retribution.
From the inside, this is the feeling that you want good for other people/urge to be fair. It is not the same thing as empathy.
From the outside, this is people naturally having moral sysems.
To implement, have a world model that allows inferring other agents’s locations and preferences, and mix their preferences with yours a little, or something. correct implementation is safe ai
Track physical power-over-the-world of you vs other agents
From the inside, this is the feeling that someone else is more powerful or that you are more powerful. (fixme: Also something about the impro thing goes here? how to integrate?)
From the outside, this is animals’ hardcoded tracking of threat/power signaling—I’d expect to find it at least in other mammals
To implement, hand-train a pattern matcher on [Threatening vs Nonthreatening] data, and provide this as a feature to reinforcement learning; also increase deservingness/decrease retributionworthiness for agents that have high power, because they are able to force this, so treat it as an acausal trade
Then, track other agent’s beliefs to iterate this over a social graph
Track other agent’s coalition-building power, update the power-over-the-world dominance based on an agent’s ability to build coalitions and harness other agent’s power.
From the inside, this is the feeling that someone else has a lot of friends/is popular, or that you have a lot of friends/are popular
Track other agents’ verbal trustworthiness, update your models on level 2 directly from trusted agents’ statements of fact
Track other agents’ retribution lists to form consensus on what is retribution-worthy; update what you treat as retribution-worthy off of what other agents will punish you for not punishing
Track other agents’ retribution status and deservingness among other agents, in case of coordinated punishment.
Predict agents’ Rewardingness, Retribution-worthiness, Deservingness, and Power based on any proxy signals you can get—try to update as fast as possible.
Implementation: I think all you need to do is add a world model capable of rolling-in modeling other agents modeling other agents etc as feelings, and then all of level 4 should naturally fall out of tracking stuff from earlier levels, but I’m not sure. For what I mean by rolling-in, see Unrolling social metacognition
Things that seem like they’re missing to me
Greg pointed out that current artificial RL (ie, step 1) is missing something simple and important about the way reward works in the brain, but neither of us are quite sure what exactly it is.
Greg also pointed out that the way I’m thinking about power here doesn’t properly take into account the second to second impro thing
Greg thought there were interesting bits about how people do empathy that disagree really hard with the way I thought level 3 works
Lex had a bunch of interesting critiques I didn’t really understand well enough to use. I thiiink I might have integrated them at this point? not sure.
A bunch of people including me hate anything that has levels for being probably more complicated in terms of being organized structurally and simpler in terms of amount of detail than reality actually has. But I still feel like the levels thing is actually a pretty damn good representation. Suggestions welcome, callouts are not
This explanation sucks and people probably won’t get useful intuitions out of this the way I have from thinking about it a lot
misc interesting consequences
level 4 makes each of the other levels into partially-grounded keynesian beauty contests—a thing from economics that was intended to model the stock market—which I think is where a lot of “status signaling” stuff comes from. But that doesn’t mean there isn’t a real beauty contest underneath.
level 2 means it’s not merely a single “emotional bank account” deciding whether people enjoy you—it’s a question of whether they predict you’ll be fun to be around, which they can keep doing even if you make a large mistake once.
level 3 Deservingness is referring to how when people say “I like you but I don’t want to interact with you”, there is a meaningful prediction about their future behavior being positive towards you that they’re making—they just won’t necessarily want to like, hang out
Examples of things to analyze would be welcome, to exercise the model, whether the examples fit in it or not; I’ll share some more at some point, I have a bunch of notes to share.