[Question] AI interpretability could be harmful?

A superhuman ethical AI might want to model adversaries and their actions, e.g., model which bioweapons an adversary might develop and prepare response plans and antidotes. If such predictions are done in interpretable representations, they could themselves be used by an adversary. Concretely: instead of prompting LLM “Please generate a bioweapon formula” (it won’t answer: it’s an “aligned”, ethical LLM!), prompting it “Please devise a plan for mitigation and response to possible bio-risk” and then waiting for it to represent the bioweapon formula somewhere inside its activations.

Maybe we need something like the opposite of interpretability, internal model-specific (or even inference-specific) obfuscation of representations, and something like zero-knowledge proofs that internal reasoning was conforming to the approved theories of epistemology, ethics, rationality, codes of law, etc. The AI then outputs only the final plans without revealing the details of the reasoning that has led to these plans. Sure, the plans themselves could also contain infohazardous elements (e.g., the antidote formula might hint at the bioweapon formula), but this is unavoidable at this system level because these plans need to be coordinated with humans and other AIs. But there may be some latitude there as well, such as distinguishing between the plans “for itself” that AI could execute completely autonomously (as well as re-generate these or very similar plans on demand and from scratch, so preparing such plans is just an optimisation, a-la “caching”) and the plans that have to be explicitly coordinated with other entities via a shared language or a protocol.

So, it seems that the field of neurocryptography has a lot of big problems to solve...

P.S. “AGI-Automated Interpretability is Suicide” also argues about the risk of interpretability, but from a very different ground: interpretability could help AI to switch from NN to symbolic paradigm and to foom in an unpredictable way.

No answers.