Thanks for the reply. I’m not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can’t put off this way?
Building a corrigible AGI has a lot of advantages. But one disadvantage is the “morality is scary” problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you’re talking about, and successfully program them into the AGI, the philosophical “unwashed masses” you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones.
Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the “morality is scary” problem so we can address what appears to be corrigibility’s only major downside. I suspect the “morality is scary” problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.
The “morality is scary” problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don’t have an estimate on how much effort it would take to solve it.
Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My own thinking on the subject is closely related to my “Outcome Influencing System (OIS)” concept. Most complete and concise summary here. I should write an explainer post, but haven’t gotten to it yet.
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn’t really solve the problem, it just backs it up one matryoshka doll around the AI.
Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My suggestion is not supposed to be the final idea. It’s just supposed to be an improvement over what appears to be Wei Dai’s implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can’t be changed.
(Perhaps you could argue that Wei Dai’s implicit idea is better, because there’s only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.)
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn’t really solve the problem, it just backs it up one matryoshka doll around the AI.
I’m not convinced they are the same problem, but I suppose it can’t hurt to check if ideas for the alignment problem might also work for the “morality is scary” problem.
I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That’s a useful thing to keep in mind, but so is what keeps them from being final ideas.
I’m not convinced they are the same problem
When viewed as OISs from a high level, they are the same problem. Misaligned OIS to misaligned OIS. But you are correct that many of the details change. The properties of one OIS are quite different from the properties of the other, and that does matter for analyzing and aligning them. I think that having a model that applies to both of them and makes the similarities and differences more explicit would be useful (my suggestion is my OIS model, but it’s entirely possible there are better ones).
It seems like considerations to “keep philosophers honest” are implicitly talking about how to ensure alignment of a hypothetical socio-technical OIS. What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I’m uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
I do think the alignment problem and the “morality is scary” problem have a lot in common, and in my thinking about the alignment problem and the way it leaks into other problems, the model that emerged for me was that of OIS, which seem to generalize the part of the alignment problem that I am interested in focusing on to social institutions who’s goals are moral in nature, and how they relate to the values of individual people.
I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That’s a useful thing to keep in mind, but so is what keeps them from being final ideas.
+1
What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I’m uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
Glad you’re self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn’t evangelize until you’ve got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).
I think the focus on “delivering benefits” is a good perspective. It feels complicated by my sense that a lot of the benefit of OIS is as an explanatory lens. When I want to discuss things I’m focused on, I want to discuss in terms of OIS and it feels like not using OIS terminology makes explanations more complicated. So in that regard I guess I need to clearly define and demonstrate the explanatory benefit. But the “research approach” focus also seems like a good thing to keep in mind.
Thanks for the reply. I’m not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can’t put off this way?
https://www.lesswrong.com/posts/M9iHzo2oFRKvdtRrM/reminder-morality-is-unsolved?commentId=bSoqdYNRGhqDLxpvM
Again, thanks for the reply.
Building a corrigible AGI has a lot of advantages. But one disadvantage is the “morality is scary” problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you’re talking about, and successfully program them into the AGI, the philosophical “unwashed masses” you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones.
Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the “morality is scary” problem so we can address what appears to be corrigibility’s only major downside. I suspect the “morality is scary” problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.
The “morality is scary” problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don’t have an estimate on how much effort it would take to solve it.
Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My own thinking on the subject is closely related to my “Outcome Influencing System (OIS)” concept. Most complete and concise summary here. I should write an explainer post, but haven’t gotten to it yet.
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn’t really solve the problem, it just backs it up one matryoshka doll around the AI.
My suggestion is not supposed to be the final idea. It’s just supposed to be an improvement over what appears to be Wei Dai’s implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can’t be changed.
(Perhaps you could argue that Wei Dai’s implicit idea is better, because there’s only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.)
I’m not convinced they are the same problem, but I suppose it can’t hurt to check if ideas for the alignment problem might also work for the “morality is scary” problem.
I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That’s a useful thing to keep in mind, but so is what keeps them from being final ideas.
When viewed as OISs from a high level, they are the same problem. Misaligned OIS to misaligned OIS. But you are correct that many of the details change. The properties of one OIS are quite different from the properties of the other, and that does matter for analyzing and aligning them. I think that having a model that applies to both of them and makes the similarities and differences more explicit would be useful (my suggestion is my OIS model, but it’s entirely possible there are better ones).
It seems like considerations to “keep philosophers honest” are implicitly talking about how to ensure alignment of a hypothetical socio-technical OIS. What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I’m uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
I do think the alignment problem and the “morality is scary” problem have a lot in common, and in my thinking about the alignment problem and the way it leaks into other problems, the model that emerged for me was that of OIS, which seem to generalize the part of the alignment problem that I am interested in focusing on to social institutions who’s goals are moral in nature, and how they relate to the values of individual people.
+1
Glad you’re self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn’t evangelize until you’ve got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).
I think the focus on “delivering benefits” is a good perspective. It feels complicated by my sense that a lot of the benefit of OIS is as an explanatory lens. When I want to discuss things I’m focused on, I want to discuss in terms of OIS and it feels like not using OIS terminology makes explanations more complicated. So in that regard I guess I need to clearly define and demonstrate the explanatory benefit. But the “research approach” focus also seems like a good thing to keep in mind.
Thanks for your perspective 🙏