why can’t such a structure be the main determinant of my overall behavior?
Maybe it could be! Tons of things could determine what behaviors a mind does. But why would you expect this to happen under some particular training regime not aiming for that specific outcome, or expect this to be gravitational in mindspace? Why is this natural?
My reply was intended as an argument against what seemed to be a central point of your post: that there is “inherent” difficulty with having coherence emerge in fuzzy systems like neural networks. Do you disagree that this was a central point of your post? Or do you disagree that my argument/example refutes it?
Giving a positive case for why it will happen is quite a different matter, which is what it appears like you’re asking for now.
I can try to anyways though. I think the questions breaks into two parts:
Why will AIs/NNs have goals/values at all?
Granted that training imbues AIs with goals, why will AIs end up with a single consistent goal
(I think there is an important third part, which is “(1,2) established that the AI basically can be modeled as maximizing a compact utility function, but why would the utility function from (1,2) be time-insensitive and scope-insensitive? if that is a objection of yours tell me and we can talk about it)
I think (1) has a pretty succinct answer: “wanting things is an effective way of getting things” (and we’re training the AIs to get stuff). IABIED has a chapter dedicated to it. I suspect this is not something you’ll disagree with.
I think the answer to (2) is a little more complicated and harder to explain succinctly, because it depends on what you imagine “having goals, but not in a single consistent way” means. But basically, I think the fundamental reason that (2) is true is because, almost no matter how you choose to think about it, what lack of coherence means is that the different parts will be gritting against each-other in some way, which is suboptimal from the perspective of all the constituent part, and can be avoided by coordination (or by one part killing off the other parts). And agents coordinating properly makes the whole system behave like a single agent.
I think this reasoning holds for all the ways humans are incoherent. I mean, specifying exactly how humans are incoherent is its own post, but I think a low-resolution way of thinking about it is that we have different values at different times and in different contexts. And with this framing the above explanation clearly works.
Like to give a very concrete example. Right now I can clearly see that lying in bed at 00:00, browsing twitter is stupid. But I know that if I lie down in bed and turn on my phone, what seems salient will change, and I very well might end up doing the thing that in this moment appears to me stupid. So what do I do? A week ago, I came up with a clever plan to leave my phone outside my room when I go to sleep, effectively erasing 00:00-twitter-william from existence muahahah!!
Another way of thinking about it is like, imagine inside my head there were two ferrets operating me like a robot. One wants to argue on lesswrong, the other wants to eat bagels. If they fight over stuff, like the lw-ferret causes the robot-me to drop the box of 100 bagels they’re carrying so they can argue on lesswrong for 5 minutes, or the bagel-ferret sells robot-me’s phone for 10 bucks so they can buy 3 bagels, they’re both clearly getting less than they could be cooperating, so they’d unite, and behave as something maximizing something like min(c_1 * bagels, c_2 * time on lesswrong).
Maybe it could be! Tons of things could determine what behaviors a mind does. But why would you expect this to happen under some particular training regime not aiming for that specific outcome, or expect this to be gravitational in mindspace? Why is this natural?
My reply was intended as an argument against what seemed to be a central point of your post: that there is “inherent” difficulty with having coherence emerge in fuzzy systems like neural networks. Do you disagree that this was a central point of your post? Or do you disagree that my argument/example refutes it?
Giving a positive case for why it will happen is quite a different matter, which is what it appears like you’re asking for now.
I can try to anyways though. I think the questions breaks into two parts:
Why will AIs/NNs have goals/values at all?
Granted that training imbues AIs with goals, why will AIs end up with a single consistent goal
(I think there is an important third part, which is “(1,2) established that the AI basically can be modeled as maximizing a compact utility function, but why would the utility function from (1,2) be time-insensitive and scope-insensitive? if that is a objection of yours tell me and we can talk about it)
I think (1) has a pretty succinct answer: “wanting things is an effective way of getting things” (and we’re training the AIs to get stuff). IABIED has a chapter dedicated to it. I suspect this is not something you’ll disagree with.
I think the answer to (2) is a little more complicated and harder to explain succinctly, because it depends on what you imagine “having goals, but not in a single consistent way” means. But basically, I think the fundamental reason that (2) is true is because, almost no matter how you choose to think about it, what lack of coherence means is that the different parts will be gritting against each-other in some way, which is suboptimal from the perspective of all the constituent part, and can be avoided by coordination (or by one part killing off the other parts). And agents coordinating properly makes the whole system behave like a single agent.
I think this reasoning holds for all the ways humans are incoherent. I mean, specifying exactly how humans are incoherent is its own post, but I think a low-resolution way of thinking about it is that we have different values at different times and in different contexts. And with this framing the above explanation clearly works.
Like to give a very concrete example. Right now I can clearly see that lying in bed at 00:00, browsing twitter is stupid. But I know that if I lie down in bed and turn on my phone, what seems salient will change, and I very well might end up doing the thing that in this moment appears to me stupid. So what do I do? A week ago, I came up with a clever plan to leave my phone outside my room when I go to sleep, effectively erasing 00:00-twitter-william from existence muahahah!!
Another way of thinking about it is like, imagine inside my head there were two ferrets operating me like a robot. One wants to argue on lesswrong, the other wants to eat bagels. If they fight over stuff, like the lw-ferret causes the robot-me to drop the box of 100 bagels they’re carrying so they can argue on lesswrong for 5 minutes, or the bagel-ferret sells robot-me’s phone for 10 bucks so they can buy 3 bagels, they’re both clearly getting less than they could be cooperating, so they’d unite, and behave as something maximizing something like min(c_1 * bagels, c_2 * time on lesswrong).