Jacob_Hilton comments on Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton 19 Jan 2022 21:18 UTC
LW: 1 AF: 1
AF
one concrete thing I might hope for you to do...
I think this is included in what I intended by “adversarial training”: we’d try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
- Charlie Steiner 19 Jan 2022 22:56 UTC
  LW: 2 AF: 1
  AF Parent
  Sure—another way of phrasing what I’m saying is that I’m not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.
  It would be bad if we build an AI that wasn’t robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.