Thanks for this! I continue to be excited about the possibility of making large black-box models safer by breaking them down into smaller modular components, and where possible, making those components be more interpretable distillations of the learned logic, like synthesized programs. Related post with further links in comments: https://www.lesswrong.com/posts/xERh9dkBkHLHp7Lg6/making-it-harder-for-an-agi-to-trick-us-with-stvs
Thanks for this! I continue to be excited about the possibility of making large black-box models safer by breaking them down into smaller modular components, and where possible, making those components be more interpretable distillations of the learned logic, like synthesized programs. Related post with further links in comments: https://www.lesswrong.com/posts/xERh9dkBkHLHp7Lg6/making-it-harder-for-an-agi-to-trick-us-with-stvs