Thanks for this explanation, it definitely makes your position more understandable.
and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
Things I’m pretty sure about: that your possibility 1 is much more likely than 2. That extrapolation is more like resolving internal conflicts in a set of values, not making them change direction altogether. That the only way for a set of values to extrapolate to “good” is if its starting percentage of “good” is high enough to win out.
Things I believe, but with less confidence: that individual desires will often extrapolate to a pretty nasty kind of selfishness (“power corrupts”). That starting from culture also has lots of dangers (like the wokeness or religion that you’re worried about), but a lot of it has been selected in a good direction for a long time, precisely to counteract the selfishness of individuals. So the starting percentage of good in culture might be higher.
Thanks for this explanation, it definitely makes your position more understandable.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
Things I’m pretty sure about: that your possibility 1 is much more likely than 2. That extrapolation is more like resolving internal conflicts in a set of values, not making them change direction altogether. That the only way for a set of values to extrapolate to “good” is if its starting percentage of “good” is high enough to win out.
Things I believe, but with less confidence: that individual desires will often extrapolate to a pretty nasty kind of selfishness (“power corrupts”). That starting from culture also has lots of dangers (like the wokeness or religion that you’re worried about), but a lot of it has been selected in a good direction for a long time, precisely to counteract the selfishness of individuals. So the starting percentage of good in culture might be higher.