RogerDearnaley comments on A Case for the Least Forgiving Take On Alignment

RogerDearnaley 5 Dec 2023 10:41 UTC
LW: 1 AF: 1
0
AF
We have to get the AIs values exactly aligned with human values
This is a major crux for me, and one of the primary reasons my P(DOOM) isn’t >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words ‘human’ and ‘values’ mean and what the definition of “human values” is, like any small LLM has), and b) not kill everyone while it’s learning the details, and it will do its research and Bayesianly converge on human values (and if it’s not capable of competently being Bayesian enough to do that, it’s not superhuman, at least at STEM). So, if you use value learning, the only piece you need to get exactly right (for outer alignment) is the phrasing of the terminal goal saying “Use value learning”. For something containing an LLM, I think that might be about one short paragraph of text, possibly with one equation in it. The prospect of getting one paragraph of text and one equation right, with enough polishing and peer review, doesn’t actually seem that daunting to me.