CRG

Karma: 134

Interpreting Neural Networks through the Polytope Lens

23 Sep 2022 17:58 UTC

149 points

CRG 27 Aug 2022 9:25 UTC
3 points
0
on: Taking the parameters which seem to matter and rotating them until they don’t
This is a great approach imo. I’ve tried something similar in transformers using the singular vectors of the embedding matrix (the d_model x d_model matrix) to rotate the matrices connected to the residual stream. This seemed to induce sparsity in the weights close to the first layer with decreasing effect moving deeper into the model. Tried this with the clip VIT-B and GPT-J, with the effect being a lot weaker in GPT-J. Also, some of the singular vectors of the embeddings were easily interpretable, with the top component being related to raw token frequency and interesting directions in GPT-J, (religion—technology) (positive—negative valence), and the top components of CLIP being color and frequency filters.

CRG 3 Aug 2022 21:20 UTC
4 points
0
in reply to: gwern’s comment on: chinchilla’s wild implications
WD is not really about regularisation nowadays, so it’s not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn’t matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).

CRG 18 Jul 2022 21:17 UTC
1 point
0
in reply to: gwern’s comment on: Forecasting ML Benchmarks in 2023
Yeah, it’s not really clear how to apply that specific kind of data pruning (straightforward for an image classifier) to the case of causally modelling text tokens in full context windows or any other dense task like that.