I found working through the details of this very informative. For what it’s worth, I’ll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into M1 and M2 reminds me of the factorisation into the map m↦Sm from a model to its capability vector (this being the analogue of M2) and the map Sm↦σ−1(Em)=βTSm+α from capability vectors to downstream metrics (this being the analogue of M1) in Ruan et al’s observational scaling laws paper.
In your case the output metrics have an interesting twist, in that you don’t want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it’s some kind of “stable” latent space of capabilities that you’re constructing.
Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object I(M) in your case, or latent spaces of capabilities in Ruan et al’s case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.
As an entertaining aside: as an algebraic geometer the proliferation of I1(M),I2(M),…’s i.e. “interpretability objects” between models M and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for “motives” to unify them. That is basically interpretability for schemes!
Makes sense to me, thanks for the clarifications.
I found working through the details of this very informative. For what it’s worth, I’ll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into M1 and M2 reminds me of the factorisation into the map m↦Sm from a model to its capability vector (this being the analogue of M2) and the map Sm↦σ−1(Em)=βTSm+α from capability vectors to downstream metrics (this being the analogue of M1) in Ruan et al’s observational scaling laws paper.
In your case the output metrics have an interesting twist, in that you don’t want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it’s some kind of “stable” latent space of capabilities that you’re constructing.
Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object I(M) in your case, or latent spaces of capabilities in Ruan et al’s case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.
As an entertaining aside: as an algebraic geometer the proliferation of I1(M),I2(M),…’s i.e. “interpretability objects” between models M and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for “motives” to unify them. That is basically interpretability for schemes!