simeon_c comments on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

simeon_c 12 Oct 2024 21:42 UTC
6 points
0
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.

Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.

The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.

Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.