in particular I think safety cases around “concept bottlenecks” are very promising (for much the same reasons that safety cases around CoT are promising, but with potentially more scalability / robustness to situational awareness)
Thanks, appreciate it! Interested if you have any particular tasks you’d want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).
very excited about “end to end” interpretability.
in particular I think safety cases around “concept bottlenecks” are very promising (for much the same reasons that safety cases around CoT are promising, but with potentially more scalability / robustness to situational awareness)
Thanks, appreciate it! Interested if you have any particular tasks you’d want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).