Rohin Shah comments on Redwood Research’s current project

Rohin Shah 22 Sep 2021 9:36 UTC
LW: 8 AF: 5
0
AF
Planned summary for the Alignment Newsletter:
This post introduces Redwood Research’s current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions.