Very interesting work and exciting results. This resonates with something I was thinking about after reading @RogerDearnaley’s section in his post about aspects of safety pretraining matching how we raise children. We wouldn’t show young children lots of movies about them being super villains, and we probably shouldn’t do that with our AI pretraining either (terminator, HAL, etc). At risk of going too far down the parenting analogy here… it is a notable parallel that showing both examples of good and bad is effective (as @RogerDearnaley points out)… and for those with teenagers good luck making meaningful changes in “post-training”!
What I thought was super interesting here is the finding that essentially putting context on the negative content gave more aligned behavior than simply filtering it out. Don’t sweat your kids watching terminator… but you should probably talk them through it. This feels obvious in retrospect… which I mean as a compliment and hopefully a sign this is on the right track. It also reinforces a concern I’ve been thinking about increasingly: if what models learn during pretraining shapes their alignment this strongly, then the current opacity around training data is concerning.
Very interesting work and exciting results. This resonates with something I was thinking about after reading @RogerDearnaley’s section in his post about aspects of safety pretraining matching how we raise children. We wouldn’t show young children lots of movies about them being super villains, and we probably shouldn’t do that with our AI pretraining either (terminator, HAL, etc). At risk of going too far down the parenting analogy here… it is a notable parallel that showing both examples of good and bad is effective (as @RogerDearnaley points out)… and for those with teenagers good luck making meaningful changes in “post-training”!
What I thought was super interesting here is the finding that essentially putting context on the negative content gave more aligned behavior than simply filtering it out. Don’t sweat your kids watching terminator… but you should probably talk them through it. This feels obvious in retrospect… which I mean as a compliment and hopefully a sign this is on the right track. It also reinforces a concern I’ve been thinking about increasingly: if what models learn during pretraining shapes their alignment this strongly, then the current opacity around training data is concerning.