Nathaniel Monson comments on A Naive Proposal for Constructing Interpretable AI

Nathaniel Monson 5 Aug 2023 21:35 UTC
7 points
4
One issue that I think OpenAI didn’t convince me they had dealt with is that saying “neuron activations are well correlated with x” is different from being able to say what specifically a neuron does mechanistically. I think of this similarly to how I think of the limitations of picking max activating examples from a dataset or doing gradient methods to find high activations: finding the argmax of a function doesn’t necessarily tell you much about the functions...well, functionality.

This seems like it might have a related obstacle. While this method could eg make it easier to find a focus for mechanistic interpretability, I think the bulk of the hard work would still be ahead.
- Chris_Leong 5 Aug 2023 23:53 UTC
  2 points
  0
  Parent
  I suspect there would be ways to find high activation examples that are different from our current examples, but I admit these techniques are unlikely to be quite as good as I’d like them to be.