I think one argument running through a lot of the sequences is that the parts of “human values” which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as “moral questions”. Like, these examples from your comment below:
Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?
If an AGI is hung up on these sorts of questions, then we’ve already mostly-won. That’s already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It’s already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It’s already an AI which is unlikely to flood a room in order to fill a cauldron with water.
The vast majority of human values are not things we typically think of as “moral questions”; they’re things which are so obvious that we usually don’t even think of them until they’re pointed out. But they’re still value judgements, and we can’t expect an AGI to share those value judgements by default. If we’re down to the sorts of things people usually think of as moral questions, then the vast majority of human values have already been solved.
Given that this is LW, and this was a major takeaway of the sequences (or at least it was for me), I’d guess that’s probably a fairly common background assumption.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
Anyhow, my point was more: You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
We can’t mostly-win just by fine-tuning a language model to do moral discourse.
Uh… yeah, I agree with that statement, but I don’t really see how it’s relevant. If we tune a language model to do moral discourse, then won’t it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like “they said they want fusion power, but they probably also want it to not be turn-into-bomb-able”.
Or are you using “moral discourse” in a broader sense?
You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
I disagree with the exact phrasing “fact of the matter for whether decisions are good or bad”; I’m not supposing there is any “fact of the matter”. It’s hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.
Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
English sentences don’t have to hold up to optimization pressure, our AI designs do. If I say “I’m hungry for pizza after I work out”, you could say “that doesn’t hold up to optimization pressure—I can imagine universes where you’re not hungry for pizza”, it’s like… okay, but that misses the point? There’s an implicit notion here of “if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won.”
Perhaps this notion isn’t obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.
Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say “this seems true in the main, although I can imagine situations where it’s not.” Maybe this is what you meant, in which case I agree.
I think one argument running through a lot of the sequences is that the parts of “human values” which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as “moral questions”. Like, these examples from your comment below:
If an AGI is hung up on these sorts of questions, then we’ve already mostly-won. That’s already an AI which is unlikely to wipe out the human species as a side-effect of maximizing the number of paperclips in the universe. It’s already an AI which is unlikely to induce a heart attack in its user in hopes that the user falls onto the positive feedback button. It’s already an AI which is unlikely to flood a room in order to fill a cauldron with water.
The vast majority of human values are not things we typically think of as “moral questions”; they’re things which are so obvious that we usually don’t even think of them until they’re pointed out. But they’re still value judgements, and we can’t expect an AGI to share those value judgements by default. If we’re down to the sorts of things people usually think of as moral questions, then the vast majority of human values have already been solved.
Given that this is LW, and this was a major takeaway of the sequences (or at least it was for me), I’d guess that’s probably a fairly common background assumption.
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
Anyhow, my point was more: You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
Uh… yeah, I agree with that statement, but I don’t really see how it’s relevant. If we tune a language model to do moral discourse, then won’t it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like “they said they want fusion power, but they probably also want it to not be turn-into-bomb-able”.
Or are you using “moral discourse” in a broader sense?
I disagree with the exact phrasing “fact of the matter for whether decisions are good or bad”; I’m not supposing there is any “fact of the matter”. It’s hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.
Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
English sentences don’t have to hold up to optimization pressure, our AI designs do. If I say “I’m hungry for pizza after I work out”, you could say “that doesn’t hold up to optimization pressure—I can imagine universes where you’re not hungry for pizza”, it’s like… okay, but that misses the point? There’s an implicit notion here of “if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won.”
Perhaps this notion isn’t obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.
Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say “this seems true in the main, although I can imagine situations where it’s not.” Maybe this is what you meant, in which case I agree.