Ronak_Mehta

Karma: 38

ML PhD, working on automating alignment research. Trying to be better about “just sending it”.

Ronak_Mehta 17 Jul 2025 17:23 UTC
1 point
0
in reply to: asarvazyan’s comment on: Interpretable by Design—Constraint Sets with Disjoint Limit Points
I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of “reasonable” first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won’t by default be true though, you’d still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).