Charlie Steiner comments on General alignment plus human values, or alignment via human values?

Charlie Steiner 4 Nov 2021 0:53 UTC
LW: 2 AF: 1
0
AF
My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can’t be how the AI generates its ideas in the first place.
So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a “human distribution”)?