First, we could treat the internal activations of machine-learning models such as artificial neural networks as “givens” and try to condense them to yield more interpretable features.
I do think this is a good insight. Or like, it’s not new, SAEs do this; but it’s fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.
I do think this is a good insight. Or like, it’s not new, SAEs do this; but it’s fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.