[Question] In the context of AI interp. What is a feature exactly?

As I read more about previous interpretability work, I’ve noticed this trend that implicitly defines a feature in this weird human centric way. It’s this weird prior that expects networks to automatically generate features that correspond with how we process images/​text because… why exactly?

Chris Olah’s team at Anthropic thinks about features as “Something a large enough neural network would dedicate a neuron to”. Which doesn’t have the human-centric bias, but just begs the question of what is a thing a large enough network will dedicate an neuron to? They admit that this is flawed, but say it’s their best current definition. This never felt like a good enough answer, even to go off of.

I don’t really see the alternative engaged with. What if these features aren’t robust? What if these features don’t make sense from a human point of view? It feels like everyone is engaging with an alien brain and expecting it to process things in the same way we do.

Also, I’m confused about the Linear Representation Hypothesis. It makes sense when thinking about categorical features like gender or occupation, but what about quantitative features? Is there a length direction? Multiple?

I hope there’s a paper or papers I’m missing, or maybe I’m blowing this out of proportion.

No answers.
No comments.