Thanks for reading! I’m especially interested in feedback from folks working on mechanistic interpretability or deception threat models. Does this framing feel complementary, orthogonal, or maybe just irrelevant to your current assumptions? Happy to be redirected if there are blind spots I’m missing.
Thanks for reading! I’m especially interested in feedback from folks working on mechanistic interpretability or deception threat models. Does this framing feel complementary, orthogonal, or maybe just irrelevant to your current assumptions? Happy to be redirected if there are blind spots I’m missing.