But it turns out that the resulting programs are also generally totally inscrutable to human inspection!
The paper “White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?” (from a team in mostly in Berkeley) claims to have found a transformer-like architecture with the property that the optima of the SGD training process applied to it are themselves (unrolled) alternating optimization processes that optimize a known and understood informational metric. If that is correct, that would mean that there is a mathematically tractable description of what models trained using this architecture are actually doing. So rather then just being enormous inscrutable black boxes made of tensors, their behavior can be reasoned about in close form in terms of what the resulting mesa-optimizer is actually optimizing.. The authors also claim that they would expect of the internals of models using this architecture to be particularly sparse, orthogonal, and thus interpretable. If true, these claims both sound like they would have huge implications for the mathematical analysis of the safety of machine learned models, and for Mechanical Interpretability. Is anyone with the appropriate matchatical and Mech Interpretability skills looking at thos paper and trained models using the variant o transformer architecture that the authors describe?
The paper “White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?” (from a team in mostly in Berkeley) claims to have found a transformer-like architecture with the property that the optima of the SGD training process applied to it are themselves (unrolled) alternating optimization processes that optimize a known and understood informational metric. If that is correct, that would mean that there is a mathematically tractable description of what models trained using this architecture are actually doing. So rather then just being enormous inscrutable black boxes made of tensors, their behavior can be reasoned about in close form in terms of what the resulting mesa-optimizer is actually optimizing.. The authors also claim that they would expect of the internals of models using this architecture to be particularly sparse, orthogonal, and thus interpretable. If true, these claims both sound like they would have huge implications for the mathematical analysis of the safety of machine learned models, and for Mechanical Interpretability. Is anyone with the appropriate matchatical and Mech Interpretability skills looking at thos paper and trained models using the variant o transformer architecture that the authors describe?