leogao comments on Why Not Just Train For Interpretability?

leogao 25 Nov 2025 17:39 UTC
5 points
0
fwiw, I think the 100-1000x number is quite pessimistic, in that we didn’t try very hard to make our implementation efficient, we were entirely focused on making it work at all. while I think it’s unlikely our method will ever reach parity with frontier training methods, it doesn’t seem crazy that we could reduce the gap a lot.
and I think having something 100x behind the frontier (i.e one GPT worth) is still super valuable for developing a theory of intelligence! like I claim it would be super valuable if aliens landed and gave us an interpretable GPT-4 or even GPT-3 without telling us how to make our own or scale it up.
- Thane Ruthenis 25 Nov 2025 18:34 UTC
  3 points
  0
  Parent
  Agreed. I think the most optimistic case is that peering at GPT-3/4′s interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4′s level, but maybe something in that rough direction.)
  - leogao 25 Nov 2025 20:13 UTC
    2 points
    0
    Parent
    fwiw, I’m pessimistic that you will actually be able to make big compute efficiency improvements even by fully understanding gpt-n. or at least, for an equivalent amount of effort, you could have improved compute efficiency vastly more by just doing normal capabilities research. my general belief is that the kind of understanding you want for improving compute efficiency is at a different level of abstraction than the kind of understanding you want for getting a deep understanding of generalization properties.