They said they first train a very powerful model, then “align” it—so they better hope it can’t do anything bad until after they make it safe.
Then, as you point out, they are implicitly trusting that the unsafe reasoning won’t have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying “we build it and it’s definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.”)
Another problem seems important to flag.
They said they first train a very powerful model, then “align” it—so they better hope it can’t do anything bad until after they make it safe.
Then, as you point out, they are implicitly trusting that the unsafe reasoning won’t have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying “we build it and it’s definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.”)