can’t we just look at weights?
As I understand, interpretability research doesn’t exactly got stuck, but it’s very-very-very far from something like this even for not-SotA models. And the gap is growing.
As I understand, interpretability research doesn’t exactly got stuck, but it’s very-very-very far from something like this even for not-SotA models. And the gap is growing.