We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data. Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.
This is valid; they’re not well fleshed out above. I’ll take a stab at it here below, and I discussed it a bit with Ryan below his comment. Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?
RE MLP activations distribution: Good idea! One reason I didn’t really want to make too many assumptions that were specific to MLPs was that we should in theory be able to apply sparse coding to residual stream activations too. But looking closely at the distribution that you’re trying to model is, generally speaking, a good idea :) We’ll probably do that for the next round of experiments if we continue along this avenue.
As Charlie Steiner pointed out, you are using a very favorable ratio of G/h in the toy model , i.e. of number of ground truth features to encoding dimension. I would expect you will mostly get antipodal pairs in that setup, rather than strongly interfering superposition. This may contribute significantly to the mismatch.
I hadn’t previously considered the importance of ‘strongly interfering’ superposition. But that’s clearly the right regime for real networks and probably does explain a lot about the mismatch. Thanks for highlighting this!
For the MMCS plots, we would be interested in seeing the distribution/histogram of MCS values. Especially for ~middling MCS values, where it’s not clear if all features are somewhat represented or some are a lot and some not at all.
Agree that this would be interesting! Trenton has had some ideas for metrics that better capture this notion, I think.
While we don’t think this has a big impact compared to the other potential mismatches between toy model and the MLP, we do wonder whether the model has the parameters/data/training steps it needs to develop superposition of clean features. e.g. in the toy models report, Elhage et al. reported phase transitions of superposition over the course of training
Undertrained autoencoders is something that worries me too, especially for experiments that use larger dictionaries (They take longer to converge). In the next phase, this is definitely something we’d want to ensure/study in the next phase.
Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?
I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It’s hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it’s trying to approximate or why you think that the true distribution has property XYZ.
Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn’t transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don’t hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.
Undertrained autoencoders
I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you’re thinking about that.
(ETA to the OC: the antipodal pairs wouldn’t happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you’d see that. I’m now less sure about this specific argument)
This is valid; they’re not well fleshed out above. I’ll take a stab at it here below, and I discussed it a bit with Ryan below his comment. Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?
RE MLP activations distribution: Good idea! One reason I didn’t really want to make too many assumptions that were specific to MLPs was that we should in theory be able to apply sparse coding to residual stream activations too. But looking closely at the distribution that you’re trying to model is, generally speaking, a good idea :) We’ll probably do that for the next round of experiments if we continue along this avenue.
I hadn’t previously considered the importance of ‘strongly interfering’ superposition. But that’s clearly the right regime for real networks and probably does explain a lot about the mismatch. Thanks for highlighting this!
Agree that this would be interesting! Trenton has had some ideas for metrics that better capture this notion, I think.
Undertrained autoencoders is something that worries me too, especially for experiments that use larger dictionaries (They take longer to converge). In the next phase, this is definitely something we’d want to ensure/study in the next phase.
I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It’s hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it’s trying to approximate or why you think that the true distribution has property XYZ.
Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn’t transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don’t hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.
I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you’re thinking about that.
(ETA to the OC: the antipodal pairs wouldn’t happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you’d see that. I’m now less sure about this specific argument)