Oliver Daniels comments on Scalable End-to-End Interpretability

Oliver Daniels 19 Dec 2025 4:42 UTC
LW: 9 AF: 3
9
AF
very excited about “end to end” interpretability.

in particular I think safety cases around “concept bottlenecks” are very promising (for much the same reasons that safety cases around CoT are promising, but with potentially more scalability / robustness to situational awareness)
- jsteinhardt 19 Dec 2025 20:39 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks, appreciate it! Interested if you have any particular tasks you’d want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).
  - Oliver Daniels 12 Feb 2026 0:56 UTC
    1 point
    0
    Parent
    maybe too late, but here are some thoughts (TL;DR out-of-distribution prompt-based stress tests, and maybe some fancy SDF stuff) https://www.lesswrong.com/posts/RQadLjnmBZtvg7p8W/on-meta-level-adversarial-evaluations-of-white-box-alignment