[...] do people who for example proposed (C)IRL without acknowledging the difficulties you described in this post actually understand these difficulties and just want to write papers that show some sort of forward progress, or are they not aware of them?
As someone who has worked on IRL a little bit, my impression is that such algorithms are not intended to capture human value to its full extent, but rather to learn shorter-term instrumental preferences. Paul gives some arguments for such “narrow value learning” here. This scenario, where human abilities are augmented using AI assistants, falls under your AI/human ecosystem category. I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation. Rather, AI assistants are seen as helping with essentially all tasks, including analyzing the consequences of decisions that have potentially far-reaching impacts, deciding when to keep our options open, and engineering the next-generation AI/human system in a way that maintains alignment. I don’t think this sort of bootstrapping process is understood very well, though.
(I saw your comment several days ago but couldn’t reply until now. Apparently it was in some sort of moderation state.)
I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation.
This is what worries me though. It seems obvious to me that AI will augment some activities more than others, or earlier than others, and looking at the past, it seems that the activities that benefit most or earliest from AI augmentation are the ones we understand best in a computational or mathematical sense. For example, scientific computations, finding mathematical proofs, chess. I’d expect moral philosophy to be one of the last activities to benefit significantly from AI augmentation, since it seems really hard to understand what it is we’re doing when we’re trying to figure out our values, or even how to recognize a correct solution to this problem.
So in this approach we have to somehow build an efficient/competitive aligned system out of a core who doesn’t know what their values are, and doesn’t explicitly know how to find out what their values are, or worse, think they do know but is just plain wrong. (The latter perhaps applies for the great majority of the world’s population.) I’d feel a lot better if people recognized this as a core difficulty, instead of brushing it away by assuming that moral philosophy won’t differentially benefit less from augmentation (if that is indeed what they’re doing). BTW, I think Paul does recognize this, but I’m talking about people outside of FHI/MIRI/LessWrong.
Do you think that we can consider this as its own problem, of technology outpacing philosophy, which we can evaluate separately of other aspects of AI risk? Or are these problems tied together in a critical way?
In the past people have argued that we needed to resole a wide range of philosophical questions prior to constructing AI because we would need to lock in answers to those questions at that point. I would like to push back against that view, while acknowledging that there may be object-level issues where we pay a costs because we lack philosophical understanding (e.g. how to trade off haste vs. extinction risk, how to deal with the possibility of strange physics, how to bargain effectively...). And I would further acknowledge that AI may have a differential effect on progress in physical technology vs. philosophy.
My current tentative view is that the total object-level cost from philosophical error is modest over the next subjective century. I also believe that you overestimate the differential effects of AI, but that’s also not very firm. If my view changed on these points it might make me more enthusiastic about philosophy or metaphilosophy as research projects.
I have a much stronger belief that we should treat metaphilosophy and AI control as separate problems, and in particular that these concerns about metaphilosophy should not significantly dampen my enthusiasm for my current approach to resolving control problems.
I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.
So this issue doesn’t block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations to take place inside AI’s values. In particular, it motivates figuring out what kind of thing AI’s values are, in sufficient generality so that it would be able to represent the results of unexpected future philosophical progress.
If we could model humans as having well-defined values but irrational in predictable ways (e.g., due to computational constraints or having a limited repertoire of heuristics), then some variant of CIRL might be sufficient (along with solving certain other technical problems such as corrigibility and preventing bugs) for creating aligned AIs. I was (and still am) worried that some researchers think this is actually true, or by not mentioning further difficulties, give the wrong impression to policymakers and other researchers.
If you are already aware of the philosophical/metaphilosophical problems mentioned here, and have an approach that you think can work despite them, then it’s not my intention to dampen your enthusiasm. We may differ on how much expected value we think your approach can deliver, but I don’t really know another approach that you can more productively spend your time on.
As someone who has worked on IRL a little bit, my impression is that such algorithms are not intended to capture human value to its full extent, but rather to learn shorter-term instrumental preferences. Paul gives some arguments for such “narrow value learning” here. This scenario, where human abilities are augmented using AI assistants, falls under your AI/human ecosystem category. I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation. Rather, AI assistants are seen as helping with essentially all tasks, including analyzing the consequences of decisions that have potentially far-reaching impacts, deciding when to keep our options open, and engineering the next-generation AI/human system in a way that maintains alignment. I don’t think this sort of bootstrapping process is understood very well, though.
(I saw your comment several days ago but couldn’t reply until now. Apparently it was in some sort of moderation state.)
This is what worries me though. It seems obvious to me that AI will augment some activities more than others, or earlier than others, and looking at the past, it seems that the activities that benefit most or earliest from AI augmentation are the ones we understand best in a computational or mathematical sense. For example, scientific computations, finding mathematical proofs, chess. I’d expect moral philosophy to be one of the last activities to benefit significantly from AI augmentation, since it seems really hard to understand what it is we’re doing when we’re trying to figure out our values, or even how to recognize a correct solution to this problem.
So in this approach we have to somehow build an efficient/competitive aligned system out of a core who doesn’t know what their values are, and doesn’t explicitly know how to find out what their values are, or worse, think they do know but is just plain wrong. (The latter perhaps applies for the great majority of the world’s population.) I’d feel a lot better if people recognized this as a core difficulty, instead of brushing it away by assuming that moral philosophy won’t differentially benefit less from augmentation (if that is indeed what they’re doing). BTW, I think Paul does recognize this, but I’m talking about people outside of FHI/MIRI/LessWrong.
Do you think that we can consider this as its own problem, of technology outpacing philosophy, which we can evaluate separately of other aspects of AI risk? Or are these problems tied together in a critical way?
In the past people have argued that we needed to resole a wide range of philosophical questions prior to constructing AI because we would need to lock in answers to those questions at that point. I would like to push back against that view, while acknowledging that there may be object-level issues where we pay a costs because we lack philosophical understanding (e.g. how to trade off haste vs. extinction risk, how to deal with the possibility of strange physics, how to bargain effectively...). And I would further acknowledge that AI may have a differential effect on progress in physical technology vs. philosophy.
My current tentative view is that the total object-level cost from philosophical error is modest over the next subjective century. I also believe that you overestimate the differential effects of AI, but that’s also not very firm. If my view changed on these points it might make me more enthusiastic about philosophy or metaphilosophy as research projects.
I have a much stronger belief that we should treat metaphilosophy and AI control as separate problems, and in particular that these concerns about metaphilosophy should not significantly dampen my enthusiasm for my current approach to resolving control problems.
I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.
So this issue doesn’t block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations to take place inside AI’s values. In particular, it motivates figuring out what kind of thing AI’s values are, in sufficient generality so that it would be able to represent the results of unexpected future philosophical progress.
If we could model humans as having well-defined values but irrational in predictable ways (e.g., due to computational constraints or having a limited repertoire of heuristics), then some variant of CIRL might be sufficient (along with solving certain other technical problems such as corrigibility and preventing bugs) for creating aligned AIs. I was (and still am) worried that some researchers think this is actually true, or by not mentioning further difficulties, give the wrong impression to policymakers and other researchers.
If you are already aware of the philosophical/metaphilosophical problems mentioned here, and have an approach that you think can work despite them, then it’s not my intention to dampen your enthusiasm. We may differ on how much expected value we think your approach can deliver, but I don’t really know another approach that you can more productively spend your time on.