Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.
Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.
Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.