I’m a little sad that much of safety research has fully pivoted to post-hoc explanations of frontier Shoggoths. I think there’s probably low hanging fruit to grow an easier to understand Shoggoth, even if it’s not with a simplex :).
I agree. I’m pretty new to the field and was surprised to see few recent attempts to build interpretable models from the ground up.
Natural, Axis-Aligned Bases. The bases where a single element is 1 and the rest are 0 explicitly define our “corners” and correspond directly to “interpretable” points of our set. These are points where all other dimensions are “off”, and the only forward contribution comes from a single dimension. This also means that every element in the simplex is a linear, convex combination of the basis elements.
Would this mean that (assuming there are ways to design NN layers to be naturally restricted to the simplex) to interpret d types of behaviors, one would have to a priori decide what d is and train a model with a d-dimensional simplex?
I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of “reasonable” first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won’t by default be true though, you’d still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).
I agree. I’m pretty new to the field and was surprised to see few recent attempts to build interpretable models from the ground up.
Would this mean that (assuming there are ways to design NN layers to be naturally restricted to the simplex) to interpret d types of behaviors, one would have to a priori decide what d is and train a model with a d-dimensional simplex?
I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of “reasonable” first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won’t by default be true though, you’d still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).