We want to build tools and frameworks to make interpretability with neural nets more accessible, and to help reframe conceptual problems in concrete terms.
Will you make your tools and frameworks open source so that, in addition to helping advance the work of your own researchers, they can help independent interpretability researchers and those working in other groups as well?
Probably. It is likely that we will publish a lot of our interpretability work and tools, but we can’t commit to that because, unlike some others, we think it’s almost guaranteed that some interpretability work will lead to very infohazardous outcomes. For example, obvious ways in which architectures could be trained more efficiently, and as such we need to consider each result on a case by case basis. However, if we deem them safe, we would definitely like to share as many of our tools and insights as possible.
Your website says:
Will you make your tools and frameworks open source so that, in addition to helping advance the work of your own researchers, they can help independent interpretability researchers and those working in other groups as well?
Probably. It is likely that we will publish a lot of our interpretability work and tools, but we can’t commit to that because, unlike some others, we think it’s almost guaranteed that some interpretability work will lead to very infohazardous outcomes. For example, obvious ways in which architectures could be trained more efficiently, and as such we need to consider each result on a case by case basis. However, if we deem them safe, we would definitely like to share as many of our tools and insights as possible.