Davidmanheim comments on On Deliberative Alignment

Davidmanheim 12 Feb 2025 4:14 UTC
21 points
7
Another problem seems important to flag.
They said they first train a very powerful model, then “align” it—so they better hope it can’t do anything bad until after they make it safe.
Then, as you point out, they are implicitly trusting that the unsafe reasoning won’t have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying “we build it and it’s definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.”)