Developmental interpretability ≠ interpretability-over-time. Two years ago, we proposed “developmental interpretability,” a research agenda that applies singular learning theory (SLT) to study how neural networks learn. In the time since, the broader field of “interpretability-over-time” has grown, and our ambitions for “SLT-for-safety” have expanded beyond just understanding learning.
In response to these changes, I thought I’d write a quick clarification on where the current boundaries are:
“Interpretability-over-time” is about applying interpretability techniques throughout training to study how structure forms in models (see examples using crosscoders, circuit discovery tools, and behavioral signals). This is a field that predates Timaeus and doesn’t a priori require SLT.
We first coined the term “developmental interpretability” with a narrower, technical meaning in mind. The term has since drifted in practice, with many using it interchangeably with “interpretability-over-time.” For clarity about our own research agenda, we use “developmental interpretability” to refer to the following specific methodology:
SGD to Bayes: Model the SGD learning process (which is easy to implement but hard to describe theoretically) with an idealized Bayesian learning process (which is hard to implement but easy to describe theoretically).
Invoke SLT: Use singular learning theory (SLT) to make predictions (based on the singular learning process of SLT) and measuring devices (“spectroscopy”) for studying this idealized process.
Back to SGD: Apply those predictions and tools in the original SGD setting to discover novel developmental phenomena and interpret them.
The singular learning process. Singular learning theory is a theory of Bayesian statistics that predicts that the learning process is organized by Bayesian phase transitions (aka “developmental stages”). The “novel developmental phenomena” we’re hoping to discover and interpret with SLT are precisely these phase transitions. We’ve now successfully applied this pipeline across a range of settings, including synthetic toy models (superposition, list sorting, and in-context linear regression), vision models, and languagemodels.
Spectroscopy, in this context, refers to the broader toolkit of SLT-derived measuring devices, including LLCs, refined LLCs, susceptibilities, and Bayesian influence functions (which are really just another type of susceptibility). The name is borrowed from “spectroscopy” in statistical physics, which refers to the study of (electromagnetic) spectra emitted by a physical system to infer that system’s microscopic structure. It’s the same math, just different materials.
One of the reasons we started by focusing on development was practical. The LLC is a scalar. The LLC-over-time is a function. Studying development allowed us to extract much more information from a fixed coarse instrument. As our tools have improved, we’re able to use them innewcontexts without always needing additional developmental information.
Outlook. Our theory of change remains centered on development. But as the research has succeeded, it has become clear that the range of phenomena we can study with SLT is not limited to development. It has also become clearer that we can use these tools not just for passive interpretability but also for active control.
Developmental interpretability ≠ interpretability-over-time. Two years ago, we proposed “developmental interpretability,” a research agenda that applies singular learning theory (SLT) to study how neural networks learn. In the time since, the broader field of “interpretability-over-time” has grown, and our ambitions for “SLT-for-safety” have expanded beyond just understanding learning.
In response to these changes, I thought I’d write a quick clarification on where the current boundaries are:
“Interpretability-over-time” is about applying interpretability techniques throughout training to study how structure forms in models (see examples using crosscoders, circuit discovery tools, and behavioral signals). This is a field that predates Timaeus and doesn’t a priori require SLT.
We first coined the term “developmental interpretability” with a narrower, technical meaning in mind. The term has since drifted in practice, with many using it interchangeably with “interpretability-over-time.” For clarity about our own research agenda, we use “developmental interpretability” to refer to the following specific methodology:
SGD to Bayes: Model the SGD learning process (which is easy to implement but hard to describe theoretically) with an idealized Bayesian learning process (which is hard to implement but easy to describe theoretically).
Invoke SLT: Use singular learning theory (SLT) to make predictions (based on the singular learning process of SLT) and measuring devices (“spectroscopy”) for studying this idealized process.
Back to SGD: Apply those predictions and tools in the original SGD setting to discover novel developmental phenomena and interpret them.
The singular learning process. Singular learning theory is a theory of Bayesian statistics that predicts that the learning process is organized by Bayesian phase transitions (aka “developmental stages”). The “novel developmental phenomena” we’re hoping to discover and interpret with SLT are precisely these phase transitions. We’ve now successfully applied this pipeline across a range of settings, including synthetic toy models (superposition, list sorting, and in-context linear regression), vision models, and language models.
Spectroscopy, in this context, refers to the broader toolkit of SLT-derived measuring devices, including LLCs, refined LLCs, susceptibilities, and Bayesian influence functions (which are really just another type of susceptibility). The name is borrowed from “spectroscopy” in statistical physics, which refers to the study of (electromagnetic) spectra emitted by a physical system to infer that system’s microscopic structure. It’s the same math, just different materials.
One of the reasons we started by focusing on development was practical. The LLC is a scalar. The LLC-over-time is a function. Studying development allowed us to extract much more information from a fixed coarse instrument. As our tools have improved, we’re able to use them in new contexts without always needing additional developmental information.
Outlook. Our theory of change remains centered on development. But as the research has succeeded, it has become clear that the range of phenomena we can study with SLT is not limited to development. It has also become clearer that we can use these tools not just for passive interpretability but also for active control.