Thane Ruthenis comments on Why Not Just Train For Interpretability?

Thane Ruthenis 25 Nov 2025 18:34 UTC
3 points
0
Agreed. I think the most optimistic case is that peering at GPT-3/4′s interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4′s level, but maybe something in that rough direction.)
- leogao 25 Nov 2025 20:13 UTC
  2 points
  0
  Parent
  fwiw, I’m pessimistic that you will actually be able to make big compute efficiency improvements even by fully understanding gpt-n. or at least, for an equivalent amount of effort, you could have improved compute efficiency vastly more by just doing normal capabilities research. my general belief is that the kind of understanding you want for improving compute efficiency is at a different level of abstraction than the kind of understanding you want for getting a deep understanding of generalization properties.