Thane Ruthenis comments on Why Not Just Train For Interpretability?

Thane Ruthenis 22 Nov 2025 0:17 UTC
13 points
0
Insofar as you’ve missed reality’s ontology, things will just look like a mess
Or your thing just won’t work. There’s a kind of trade-off there, I think?
DL works because it gives a lot of flexibility for defining internal ontologies, and for compute-efficiently traversing their space. However, it does so by giving up all guarantees that the result would be simple/neat/easy-to-understand in any given fixed external ontology (e. g., the human one).
To combat that, you can pick a feature that would provide some interpretability assistance, such as “sparsity” or “search over symbolic programs”, and push in that direction. But how hard do you push? (How big is the $L_{1}$ penalty relative to other terms? Do you give your program-search process some freedom to learn neural-net modules for plugging into your symbolic programs?)
If you proceed with a light touch, you barely have any effect, and the result is essentially as messy as before.
If you turn the dial up very high, you strangle DL’s flexibility, and so end up with crippled systems. (Useful levels of sparsity make training 100x-1000x less compute-efficient; forget symbolic program search.)
In theory, I do actually think you may be able to “play it by ear” well enough to hit upon some method where the system becomes usefully more interpretable without becoming utterly crippled. You can then study it, and perhaps learn something that would assist you in interpreting increasingly less-crippled systems. (This is why I’m still pretty interested in papers like these.)
But is there a proper way out? The catch is that your interventions only hurt performance if they hinder DL’s attempts to find the true ontology. On the other hand, if you yourself discover and incentivize/hard-code (some feature of) the true ontology, that may actually serve as an algorithmic improvement.^[1] It would constrain the search space in a helpful way, or steer the training in the right direction, or serve as a good initialization prior… Thus making the system both more interpretable and more capable.
Which is a boon in one way (will near-certainly be widely adopted; the “alignment tax” is negative), and a curse in another (beware the midpoint of that process, where you’re boosting capabilities without getting quite enough insight into models to ensure safety).
(Alternatively, you can try to come up with some Clever Plan where you’re setting up a search process that’s as flexible as DL but which somehow has a guarantee of converging to be simple in terms of your fixed external ontology. I personally think such ideas are brilliant and people should throw tons of funding at them.)
1. ^
  May. There are some caveats there.
- leogao 25 Nov 2025 17:39 UTC
  5 points
  0
  Parent
  fwiw, I think the 100-1000x number is quite pessimistic, in that we didn’t try very hard to make our implementation efficient, we were entirely focused on making it work at all. while I think it’s unlikely our method will ever reach parity with frontier training methods, it doesn’t seem crazy that we could reduce the gap a lot.
  and I think having something 100x behind the frontier (i.e one GPT worth) is still super valuable for developing a theory of intelligence! like I claim it would be super valuable if aliens landed and gave us an interpretable GPT-4 or even GPT-3 without telling us how to make our own or scale it up.
  - Thane Ruthenis 25 Nov 2025 18:34 UTC
    3 points
    0
    Parent
    Agreed. I think the most optimistic case is that peering at GPT-3/4′s interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4′s level, but maybe something in that rough direction.)
    - leogao 25 Nov 2025 20:13 UTC
      2 points
      0
      Parent
      fwiw, I’m pessimistic that you will actually be able to make big compute efficiency improvements even by fully understanding gpt-n. or at least, for an equivalent amount of effort, you could have improved compute efficiency vastly more by just doing normal capabilities research. my general belief is that the kind of understanding you want for improving compute efficiency is at a different level of abstraction than the kind of understanding you want for getting a deep understanding of generalization properties.