In a parallel universe with a saner civilization, there must be tons of philosophy professors workings with tons of AI researchers to try to improve AI’s philosophical reasoning. They’re probably going on TV and talking about 养兵千日,用兵一时 (feed an army for a thousand days, use it for an hour) or how proud they are to contribute to our civilization’s existential safety at this critical time. There are probably massive prizes set up to encourage public contribution, just in case anyone had a promising out of the box idea (and of course with massive associated infrastructure to filter out the inevitable deluge of bad ideas). Maybe there are extensive debates and proposals about pausing or slowing down AI development until metaphilosophical research catches up.
This paragraph gives me the impression that you think we should be spending a lot more time, resources and money on advancing AI philosophical competence. I think I disagree, but I’m not exactly sure where my disagreement lies. So here are some of my questions:
How difficult do you expect philosophical competence to be relative to other tasks? For example:
Do you think that Harvard philosophy-grad-student-level philosophical competence will be one of the “last” tasks to be automated before AIs are capable of taking over the world?
Do you expect that we will have robots that are capable of reliably cleaning arbitrary rooms, doing laundry, and washing dishes, before the development of AI that’s as good as the median Harvard philosophy graduate student? If so, why?
Is the “problem” more that we need a superhuman philosophical reasoning to avoid a catastrophe? Or is the problem that even top-human-level philosophers are hard to automate in some respect?
Why not expect philosophical competence to be solved “by default” more-or-less using transfer learning from existing philosophical literature, and human evaluation (e.g. RLHF, AI safety via debate, iterated amplification and distillation etc.)?
Unlike AI deception generally, it seems we should be able to easily notice if our AIs are lacking in philosophical competence, making this problem much less pressing, since people won’t be comfortable voluntarily handing off power to AIs that they know are incompetent in some respect.
To the extent you disagree with the previous bullet point, I expect it’s either because you think the problem is either (1) sociological (i.e. the problem is that people will actually make the mistake of voluntarily handing power to AIs they know are philosophically incompetent), or the problem is (2) hard because of the difficulty of evaluation (i.e. we don’t know how to evaluate what good philosophy looks like).
In case (1), I think I’m probably just more optimistic than you about this exact issue, and I’d want to compare it to most other cases where AIs fall short of top-human level performance. For example, we likely would not employ AIs as mathematicians if people thought that AIs weren’t actually good at math. This just seems obvious to me.
Case (2) seems more plausible to me, but I’m not sure why you’d find this problem particularly pressing compared to other problems of evaluation, e.g. generating economic policies that look good to us but are actually bad.
More generally, the problem of creating AIs that produce good philosophy, rather than philosophy that merely looks good, seems like a special case of the general “human simulator” argument, where RLHF is incentivized to find AIs that fool us by producing outputs that look good to us, but are actually bad. To me it just seems much more productive to focus on the general problem of how to do accurate reinforcement learning (i.e. RL that rewards honest, corrigible, and competent behavior), and I’m not sure why you’d want to focus much on the narrow problem of philosophical reasoning as a special case here. Perhaps you can clarify your focus here?
What specific problems do you expect will arise if we fail to solve philosophical competence “in time”?
Are you imagining, for example, that at some point humanity will direct our AIs to “solve ethics” and then implement whatever solution the AIs come up with? (Personally I currently don’t expect anything like this to happen in our future, at least in a broad sense.)
Technological progress will continue to advance faster than philosophical progress, making it hard or impossible for humans to have the wisdom to handle new technologies correctly. I see AI development itself as an instance of this, for example the e/acc crowd trying to advance AI without regard to safety because they think it will automatically align with their values (something about “free energy”). What if, e.g., value lock-in becomes possible in the future and many decide to lock in their current values (based on their religions and/or ideologies) to signal their faith/loyalty?
AIs will be optimized for persuasion and humans won’t know how to defend against bad but persuasive philosophical arguments aimed to manipulate them.
but I’m not sure why you’d find this problem particularly pressing compared to other problems of evaluation, e.g. generating economic policies that look good to us but are actually bad
Bad economic policies can probably be recovered from and are therefore not (high) x-risks.
My answers to many of your other questions are “I’m pretty uncertain, and that uncertainty leaves a lot of room for risk.” See also Some Thoughts on Metaphilosophy if you haven’t already read that, as it may help you better understand my perspective. And, it’s also possible that in the alternate sane universe, a lot of philosophy professors have worked with AI researchers on the questions you raised here, and adequately resolved the uncertainties in the direction of “no risk”, and AI development has continued based on that understanding, but I’m not seeing that happening here either.
Let me know if you want me to go into more detail on any of the questions.
This paragraph gives me the impression that you think we should be spending a lot more time, resources and money on advancing AI philosophical competence. I think I disagree, but I’m not exactly sure where my disagreement lies. So here are some of my questions:
How difficult do you expect philosophical competence to be relative to other tasks? For example:
Do you think that Harvard philosophy-grad-student-level philosophical competence will be one of the “last” tasks to be automated before AIs are capable of taking over the world?
Do you expect that we will have robots that are capable of reliably cleaning arbitrary rooms, doing laundry, and washing dishes, before the development of AI that’s as good as the median Harvard philosophy graduate student? If so, why?
Is the “problem” more that we need a superhuman philosophical reasoning to avoid a catastrophe? Or is the problem that even top-human-level philosophers are hard to automate in some respect?
Why not expect philosophical competence to be solved “by default” more-or-less using transfer learning from existing philosophical literature, and human evaluation (e.g. RLHF, AI safety via debate, iterated amplification and distillation etc.)?
Unlike AI deception generally, it seems we should be able to easily notice if our AIs are lacking in philosophical competence, making this problem much less pressing, since people won’t be comfortable voluntarily handing off power to AIs that they know are incompetent in some respect.
To the extent you disagree with the previous bullet point, I expect it’s either because you think the problem is either (1) sociological (i.e. the problem is that people will actually make the mistake of voluntarily handing power to AIs they know are philosophically incompetent), or the problem is (2) hard because of the difficulty of evaluation (i.e. we don’t know how to evaluate what good philosophy looks like).
In case (1), I think I’m probably just more optimistic than you about this exact issue, and I’d want to compare it to most other cases where AIs fall short of top-human level performance. For example, we likely would not employ AIs as mathematicians if people thought that AIs weren’t actually good at math. This just seems obvious to me.
Case (2) seems more plausible to me, but I’m not sure why you’d find this problem particularly pressing compared to other problems of evaluation, e.g. generating economic policies that look good to us but are actually bad.
More generally, the problem of creating AIs that produce good philosophy, rather than philosophy that merely looks good, seems like a special case of the general “human simulator” argument, where RLHF is incentivized to find AIs that fool us by producing outputs that look good to us, but are actually bad. To me it just seems much more productive to focus on the general problem of how to do accurate reinforcement learning (i.e. RL that rewards honest, corrigible, and competent behavior), and I’m not sure why you’d want to focus much on the narrow problem of philosophical reasoning as a special case here. Perhaps you can clarify your focus here?
What specific problems do you expect will arise if we fail to solve philosophical competence “in time”?
Are you imagining, for example, that at some point humanity will direct our AIs to “solve ethics” and then implement whatever solution the AIs come up with? (Personally I currently don’t expect anything like this to happen in our future, at least in a broad sense.)
The super-alignment effort will fail.
Technological progress will continue to advance faster than philosophical progress, making it hard or impossible for humans to have the wisdom to handle new technologies correctly. I see AI development itself as an instance of this, for example the e/acc crowd trying to advance AI without regard to safety because they think it will automatically align with their values (something about “free energy”). What if, e.g., value lock-in becomes possible in the future and many decide to lock in their current values (based on their religions and/or ideologies) to signal their faith/loyalty?
AIs will be optimized for persuasion and humans won’t know how to defend against bad but persuasive philosophical arguments aimed to manipulate them.
Bad economic policies can probably be recovered from and are therefore not (high) x-risks.
My answers to many of your other questions are “I’m pretty uncertain, and that uncertainty leaves a lot of room for risk.” See also Some Thoughts on Metaphilosophy if you haven’t already read that, as it may help you better understand my perspective. And, it’s also possible that in the alternate sane universe, a lot of philosophy professors have worked with AI researchers on the questions you raised here, and adequately resolved the uncertainties in the direction of “no risk”, and AI development has continued based on that understanding, but I’m not seeing that happening here either.
Let me know if you want me to go into more detail on any of the questions.