Mental subagent implications for AI Safety

Meta: I wrote this draft a couple of years ago, but never managed to really proceed down this line of thought to a conclusion. I figured it would be better off placed in public for feedback.

Assume, for the sake of argument, that breaking a human agent down into distinct mental subagents is a useful and meaningful abstraction. The appearance of human agentic choice arises from the under-the-hood consensus among mental subagents. Each subagent is a product of some simple, specific past experience or human drive. Human behavior arises from the gestalt of subagent behavior. Pretend that subagents are ontologically real.

A decent way of defining what a person genuinely wants would be if they both claim to want it before getting it, and would predictably claim that they actually wanted it after getting it. This is a decent first-pass way of ensuring sufficient inter-subagent agreement.

Some agents are going to be dis-endorsed by almost every other relevant agent. I think inside you there’s probably something like a 2-year-old toddler who just wants everyone to immediately do what you say. There are probably some really sketchy parts in there, attached to childhood traumas or evolved instincts, that you overall just don’t endorse, and which would be really bad for you if you satisfied them.

This basically implies that even a super-AI trying to “satisfy” all your “values” should actually probably ignore the part of you that is indistinguishable from Sauron. And maybe some other parts that are harder to describe.

So how does the super-AI determine which parts of you are Sauron and which parts may be kind of weird but actually deserve to be paid attention to? Like, it would be really easy to accidentally sweep “sex” into the dustbin, because it’s basically a minefield of weird selfish animalistic behaviors. The Vulcans confine it to a narrow timeframe, etc. But nobody wants a future where the AI overlord has done us the favor of cutting sex out of our lives.

Status-seeking behavior, striving, and competitive impulses are other things that I can see being accidentally binned by some process trying to “optimize” for “what humans want”.

Again, the stance here is to break all human behavior down into fundamental subagents or into fundamental goal-like or belief-like objects. If the super-AI-psychiatrist can fire a muon beam into your brain and actually see all of your subagents, which ones do they include in their model of “how to make this human happy or at least not create a dystopia for them”, and which ones do they disregard?

My first thought is that the AI would have to do something like this: take every subagent A, and check in with every single other subagent and see whether those subagents would feel good or bad (or indifferent) about subagent A getting what they wanted. Sum it all up, maybe weighted according to something like “how much suffering, by some neurological metric, is actually generated in Subagent Y when Subagent X is made happy?” And if some subagents get a very low score, then maybe ignore those.

Another idea is that this is something that kind of needs to be applied on an ongoing bases, as new subagents are continually created, or attenuated.

There’s another critically important phenomenon, which is that certain subagents can just evaporate when you shine a light on them. You may have some deep, subconscious, entirely unexamined belief about how the world is, how people are, how you are, and this belief governs as lot of your behavior. But you don’t see the belief. To you, this is just how the world is. And then maybe one day something happens in your life, or you have a great therapist, or you’re meditating, and you actually see this belief as the construct that it is, you see through it, and then it’s just gone. And it’s good that it’s gone, on some level even that belief itself would dis-endorse its own existence if it had just had this new information earlier.

But I don’t want to bite the bullet and say “all subagents that dissolve when fed new, true information should be expunged” because I feel like that might accidentally drive us insane. Like, if the AI told us, good news, the actual truth is that you are meaningless in an important cosmic sense, and here is a proof, and I am going to burn away all of your “meaning” circuitry with a laser. That’s not a good outcome.

We’re bordering on mindcrime scenarios here, but I think the AI should try to figure out which subagents are the ones that you would probably prospectively and retrospectively endorse lasering out, if you knew what the AI knows (which you don’t, and it shouldn’t just tell you), and then try to gently guide you toward the kind of organic life-realization that would cause those harmful beliefs to be fixed.

So if I had to summarize the overall idea in one line, it would be, nudge humans toward organic solutions to inner conflicts that can be fixed, and try to “do therapy” on inner agents whose desires cause significant net overall suffering, and try to factor in verbal claims (or predicted verbal claims) of “yes, I actually want that” or “no, I don’t want that” by the whole human as much as possible.

No comments.