An independent researcher of ethics, AI safety, and AI impacts. Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication).
You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don’t post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).
A Telegram group where we discuss AI x-risk/safety, theories of intelligence, agency, consciousness, and ethics, in Russian: https://t.me/agi_risk_and_ethics.
All the critiques focus on MI not being effective enough at its ultimate purpose—namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?
Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in “AGI-Automated Interpretability is Suicide”, “AI interpretability could be harmful?”, and “Why and When Interpretability Work is Dangerous”. However, I think that none of the authors of these three posts is an expert in interpretability or adversaries, so it would be really interesting to see your thinking on this topic.