johnswentworth comments on The case for aligning narrowly superhuman models

johnswentworth 12 Mar 2021 2:04 UTC
LW: 2 AF: 2
AF
We can’t mostly-win just by fine-tuning a language model to do moral discourse.
Uh… yeah, I agree with that statement, but I don’t really see how it’s relevant. If we tune a language model to do moral discourse, then won’t it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like “they said they want fusion power, but they probably also want it to not be turn-into-bomb-able”.
Or are you using “moral discourse” in a broader sense?
You said “you get what you can measure” is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said “you get what you measure” is a problem because humans can disagree when their values are ‘measured’ without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
I disagree with the exact phrasing “fact of the matter for whether decisions are good or bad”; I’m not supposing there is any “fact of the matter”. It’s hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want.
Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.