Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 8 Jan 2025 14:05 UTC
4 points
0
Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/
Some notes
- Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
- The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
- Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
- Uses a safety judge instead of human labelling