My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word “model” and just saying “the human has lots of preferences, and those preferences don’t always agree with each other, especially OOD”.
Yes, I’m fine with this rephrasing. But I wouldn’t write a post using only the “the human has the preferences” way of speaking, because lots of different ways of thinking about the world use that same language.
This is basically a “subagent” perspective.
I think this post is pretty different from how people typically describe humans in terms of subagents, but it does contain that description.
Any physical system can have multiple descriptions of it, it doesn’t have to act like it’s made of subagents. (By “act like it’s made of subagents,” I include some things people do, like psych themselves up, or reward themselves for doing chores, or try to hide objects of temptation from themselves.) You can have several different models of a thermostat, for instance. Reconciling the different models of a thermostat might look a bit like bargaining between subagents, but if so these are atrophied male anglerfish subagents; they don’t model each other and bargain on their own behalf, they are just dumb inputs in a bigger, smarter process.
If we make a bunch of partial models of a human, some of these models are going to look like subagents, or drive subagenty behavior. But a lot of other ones are going to look like simple patterns, or bigger models that contain the subagent bargaining within themselves and hold aggregated preferences, or psychological models that are pretty complicated and interesting but don’t have anything to do with subagenty behavior.
And maybe a value learning AI would capture human subagenty behavior, not only in the models that contain subagent interactions as parts of themselves, but in the learned meta-preferences that determine how different models that we’d think of as human subagents get aggregated into one big story about what’s good. Such an AI might help humans psych themselves up, or reward them for doing chores.
But I’d bet that most of the preference aggregation work would look about as subagenty as aggregating the different models of a thermostat. In the trolley problem my “save people” and “don’t kill people” preferences don’t seem subagenty at all—I’m not about to work out some internal bargain where I push the lever in one direction for a while in exchange for pushing it the other way the rest of the time, for instance.
In short, even though I agree that in a vacuum you could call each model a “subagent,” what people normally think of when they hear that word is about a couple dozen entities, mostly distinct. And what’s going on in the picture I’m promoting here is more like 10^4 entities, mostly overlapping.
ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.
ACTIVITY B: Think about the gears underlying human intelligence and motivation.
You’re doing Activity A every day. I’m doing Activity B every day.
My comment was trying to say: “The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There’s a conceptual parallel between these two different discussions.”
And I think you thought I was saying: “We both agree that the real ultimate goal right now is Activity A. I’m leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents).”
This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.
But I feel like I kind of gave a reply anyway—I don’t think the parallel with subagents is very deep. But there’s a very strong parallel (or maybe not even a parallel, maybe this is just the thing I’m talking about) with generative modeling.
Yes, I’m fine with this rephrasing. But I wouldn’t write a post using only the “the human has the preferences” way of speaking, because lots of different ways of thinking about the world use that same language.
I think this post is pretty different from how people typically describe humans in terms of subagents, but it does contain that description.
Any physical system can have multiple descriptions of it, it doesn’t have to act like it’s made of subagents. (By “act like it’s made of subagents,” I include some things people do, like psych themselves up, or reward themselves for doing chores, or try to hide objects of temptation from themselves.) You can have several different models of a thermostat, for instance. Reconciling the different models of a thermostat might look a bit like bargaining between subagents, but if so these are atrophied male anglerfish subagents; they don’t model each other and bargain on their own behalf, they are just dumb inputs in a bigger, smarter process.
If we make a bunch of partial models of a human, some of these models are going to look like subagents, or drive subagenty behavior. But a lot of other ones are going to look like simple patterns, or bigger models that contain the subagent bargaining within themselves and hold aggregated preferences, or psychological models that are pretty complicated and interesting but don’t have anything to do with subagenty behavior.
And maybe a value learning AI would capture human subagenty behavior, not only in the models that contain subagent interactions as parts of themselves, but in the learned meta-preferences that determine how different models that we’d think of as human subagents get aggregated into one big story about what’s good. Such an AI might help humans psych themselves up, or reward them for doing chores.
But I’d bet that most of the preference aggregation work would look about as subagenty as aggregating the different models of a thermostat. In the trolley problem my “save people” and “don’t kill people” preferences don’t seem subagenty at all—I’m not about to work out some internal bargain where I push the lever in one direction for a while in exchange for pushing it the other way the rest of the time, for instance.
In short, even though I agree that in a vacuum you could call each model a “subagent,” what people normally think of when they hear that word is about a couple dozen entities, mostly distinct. And what’s going on in the picture I’m promoting here is more like 10^4 entities, mostly overlapping.
Hmm. I think you missed my point…
There are two different activities:
ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.
ACTIVITY B: Think about the gears underlying human intelligence and motivation.
You’re doing Activity A every day. I’m doing Activity B every day.
My comment was trying to say: “The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There’s a conceptual parallel between these two different discussions.”
And I think you thought I was saying: “We both agree that the real ultimate goal right now is Activity A. I’m leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents).”
Does that help?
This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.
But I feel like I kind of gave a reply anyway—I don’t think the parallel with subagents is very deep. But there’s a very strong parallel (or maybe not even a parallel, maybe this is just the thing I’m talking about) with generative modeling.