Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/
Some notes
Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
Uses a safety judge instead of human labelling
Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/
Some notes
Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
Uses a safety judge instead of human labelling