Daniel Murfet comments on Prospects for Alignment Automation: Interpretability Case Study

Daniel Murfet 31 Mar 2025 2:45 UTC
3 points
0
I is evaluated on utility for improving time-efficiency and accuracy in solving downstream tasks
There seems to be a gap between this informal description and your pseudo-code, since in the pseudo-code the parameters $θ$ only parametrise the R&D agent $M_{1}$ . On the other hand $M_{2}$ is distinct and presumed to be not changing. At first reasoning from the pseudo-code I had the objection that the execution agent can’t be completely static, because it somehow has to make use of whatever clever interpretability outputs the R&D agent comes up with (e.g. SAEs don’t use themselves to solve OOD detection or whatever). Then I wondered if you wanted to bound the complexity of $M_{2}$ somewhere. Then I looked back and saw the formula $I (x, t, I (M))$ which seems to cleverly bypass this by having the R&D agent have to do both steps but factoring its representation of $M$ .
However this does seem different from the pseudo-code. If this is indeed different, which one do you intend?
Edit: no matter, I should just read more closely $e v a l u a t e (M_{2} (t, I_{M}, I))$ clearly takes $I$ as input so I think I’m not confused. I’ll leave this comment here as a monument to premature question-asking.
Later edit: ok no I’m still confused. It seems $M_{1}$ doesn’t get used in your inner loop unless it is in fact $I$ (which in the pseudo-code means just a part of what was called $I$ in the preceding text). That is, when we update $θ$ we update $I$ for the next round. In which case things fit with your original formula but having essentially factored $I$ into two pieces ( $M_{2}$ on the outside, $M_{1}$ on the inside) you are only allowing the inside piece $M_{1}$ to vary over the course of this process. So I think my original question still stands.
So to check the intuition here: we factor the interpretability algorithm $I$ into two pieces. The first piece never sees tasks and has to output some representation of the model $M$ . The second piece never sees the model and has to, given the representation and some prediction task for the original model $M$ perform well across a sufficiently broad range of such tasks. It is penalised for computation time in this second piece. So overall the loss is supposed to motivate
- Discovering the capabilities of the model as operationalised by its performance on tasks, and also how that performance is affected by variations of those tasks (e.g. modifying the prompt for your Shapley values example, and for your elicitation example).
- Representing those capabilities in a way that amortises the computational cost of mapping a given task onto this space of capabilities in order to make the above predictions (the computation time penalty in the second part).
This is plausible for the same reason that the original model can have good general performance: there are general underlying skills or capabilities that can be assembled to perform well on a wide range of tasks, and if you can discover those capabilities and their structure you should be able to generalise to predict other task performance and how it varies.
Indirectly there is a kind of optimisation pressure on the complexity of $I (M)$ just because you’re asking this to be broadly useful (for a computationally penalised $M_{2}$ ) for prediction on many tasks, so by bounding the generalisation error you’re likely to bound the complexity of that representation.
I’m on board with that, but I think it is possible that some might agree this is a path towards automated research of something but not that the something is interpretability. After all, your $I (M)$ need not be interpretable in any straightforward way. So implicitly the space of $θ$ ’s you are searching over is constrained to something instrinsically reasonably interpretable?
Since later you say “human-led interpretability absorbing the scientific insights offered by I*” I guess not, and your point is that there are many safety-relevant applications of I*(M) even if it is not very human comprehensible.
- Jacob Pfau 31 Mar 2025 14:02 UTC
  1 point
  0
  Parent
  Before concretely addressing the oversights you found, perhaps worth mentioning the intuitive picture motivating the pseudo-code. I wanted to make explicit the scientific process which happens between researchers. M1 plays the role of the methods researcher, M2 plays the role of the applications/datasets researcher. The pseudo-code is an attempt to write out crisply in what sense ‘good’ outputs from M1 can pass the test-of-time standing up to more realistic, new applications and baselines developed by M2.
  
  On to the details:
  
  Thanks for working with my questionable notation here! Indeed the uses of I were overloaded here, and I have now (hopefully) clarified by writing v_{I,M} for what was previously I(M). The type signatures I have in mind are that I is code (an interpretability method) and v_{I,M} and I(M) are some efficiently queryable representation of M (circuits, SAE weights, …) useful for downstream tasks.
  
  The sense in which M2 is static does seem important. In fact, I think M2 should have some access to M—it was an oversight that M does not appear as an input to M2. This was why I mentioned sample complexity in a footnote: It seems reasonable to give M2 limited query access to M. Thanks for catching this. In fact, perhaps the scheme could work as originally written where M2 does not have direct access to M, but I’m unsure seems too ‘static’ as you say.
  
  Regarding the appropriateness of the term interpretability to describe the target of this automation process: I agree, the output may not be an interp method in our current sense. Interpretability is the most appropriate term I could come up with. Two features seem important here: (1) white-box parsing of weights is central. (2) The resulting ‘interpretation’ v_{I,M} is usable by a fixed model M2, hence v_{I,M} must be efficiently interface-able without having learned—in weights—the structure of v_{I,M}.
  - Daniel Murfet 31 Mar 2025 19:51 UTC
    3 points
    0
    Parent
    Makes sense to me, thanks for the clarifications.
    I found working through the details of this very informative. For what it’s worth, I’ll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into $M_{1}$ and $M_{2}$ reminds me of the factorisation into the map $m \mapsto S_{m}$ from a model to its capability vector (this being the analogue of $M_{2}$ ) and the map $S_{m} \mapsto σ^{- 1} (E_{m}) = β^{T} S_{m} + α$ from capability vectors to downstream metrics (this being the analogue of $M_{1}$ ) in Ruan et al’s observational scaling laws paper.
    In your case the output metrics have an interesting twist, in that you don’t want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it’s some kind of “stable” latent space of capabilities that you’re constructing.
    Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object $I (M)$ in your case, or latent spaces of capabilities in Ruan et al’s case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.
    As an entertaining aside: as an algebraic geometer the proliferation of $I_{1} (M), I_{2} (M), \dots$ ’s i.e. “interpretability objects” between models $M$ and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for “motives” to unify them. That is basically interpretability for schemes!