Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?