IMO, you mention the main downside in the post about aligning to virtues, in that it allows AIs more leeway to make decisions on values, and a core divergence from you is that while we will need to defer to AIs eventually, I don’t expect massive enough breakthroughs in alignment to make it net-positive to allow AIs to have the level of control over generalizing on values that I’d want for safe AI, and I tend to think corrigibility plus AI control is likely to be our first-best mainline AI safety plan, and some of the purported benefits are much less likely to occur than you think it will.
The other issue is that while people do agree on virtues more than consequentialist preferences, a lot of the reason why people agree on virtues both now in the past are at least consistent with 2 phenomena occuring, and which I’d argue explain the super-majority of the effect in practice:
Technology, while it has unbundled certain things that were bundled in the distant past, and already made virtues like honor go from 80-99% of the population to ~0%, it’s still the case that on a lot of different virtues, it turns out that a lot of the option space harms or helps a lot of virtues by default, and it’s still very difficult to engineer the vast space of possible virtue disagreements between humans because it’s hard to improve one virtue without improving another, and most of our virtues are in practice built out of valuing instrumental goods, but in a post-AGI future (for now I’ll assume the AI safety problem is solved), it’s a lot easier to create goods that are valued differently by many OOMs between different virtues, which Tyler M John explains well here.
To a large extent, humans need to live and work with other humans, and it’s not really possible for anyone, even for the richest and most powerful people to simply ignore societal norms without paying heavy prices, even if only informally, and interactions are repeated often enough combined with the fact that enforcement is possible due to humans requiring a lot of logistical inputs that other humans can take away, that we can turn prisoner’s dilemma’s into either iterated prisoner’s dilemma’s or stag hunts or schelling problems (I ignore acausal trade/cooperation for decision theories like EDT/FDT/UDT because it relies on people having more impartial values than they actually have, and pure reciprocity motivations don’t work because humans can’t reason well about each other (yes, we’re surprisingly good at modeling given compute constraints, but it’s nowhere near enough)). But post-AGI, humans will be able to choose to be independent of social constraints/pressure, meaning that the forces for convergence to certain virtues will weaken a lot. Vladimir Nesov talks about that here.
IMO, I think the more plausible versions of value alignment/good futures looks like moral trade, like in this short afterward of “What We Owe The Future” or earlier on, viatopia as discussed by William Macaskill here (conditional on solving alignment).
IMO, you mention the main downside in the post about aligning to virtues, in that it allows AIs more leeway to make decisions on values, and a core divergence from you is that while we will need to defer to AIs eventually, I don’t expect massive enough breakthroughs in alignment to make it net-positive to allow AIs to have the level of control over generalizing on values that I’d want for safe AI, and I tend to think corrigibility plus AI control is likely to be our first-best mainline AI safety plan, and some of the purported benefits are much less likely to occur than you think it will.
The other issue is that while people do agree on virtues more than consequentialist preferences, a lot of the reason why people agree on virtues both now in the past are at least consistent with 2 phenomena occuring, and which I’d argue explain the super-majority of the effect in practice:
Technology, while it has unbundled certain things that were bundled in the distant past, and already made virtues like honor go from 80-99% of the population to ~0%, it’s still the case that on a lot of different virtues, it turns out that a lot of the option space harms or helps a lot of virtues by default, and it’s still very difficult to engineer the vast space of possible virtue disagreements between humans because it’s hard to improve one virtue without improving another, and most of our virtues are in practice built out of valuing instrumental goods, but in a post-AGI future (for now I’ll assume the AI safety problem is solved), it’s a lot easier to create goods that are valued differently by many OOMs between different virtues, which Tyler M John explains well here.
To a large extent, humans need to live and work with other humans, and it’s not really possible for anyone, even for the richest and most powerful people to simply ignore societal norms without paying heavy prices, even if only informally, and interactions are repeated often enough combined with the fact that enforcement is possible due to humans requiring a lot of logistical inputs that other humans can take away, that we can turn prisoner’s dilemma’s into either iterated prisoner’s dilemma’s or stag hunts or schelling problems (I ignore acausal trade/cooperation for decision theories like EDT/FDT/UDT because it relies on people having more impartial values than they actually have, and pure reciprocity motivations don’t work because humans can’t reason well about each other (yes, we’re surprisingly good at modeling given compute constraints, but it’s nowhere near enough)). But post-AGI, humans will be able to choose to be independent of social constraints/pressure, meaning that the forces for convergence to certain virtues will weaken a lot. Vladimir Nesov talks about that here.
IMO, I think the more plausible versions of value alignment/good futures looks like moral trade, like in this short afterward of “What We Owe The Future” or earlier on, viatopia as discussed by William Macaskill here (conditional on solving alignment).