I have some objections to this, from the perspective of a doc I’m writing (and will possibly post in a few weeks). I think that you’re using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn’t so clear as depicted in the article.
I agree it would be nice if we could get a second law of thermodynamics for LLMs. But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la “entropy of the latent field” is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.
Finally I would also add that the messiness is in some ways the point. I don’t have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there’s a lot of risk in being insufficiently granular if we elevate only the cleanest “order parameters” and discard messy local details. I would guess alignment failures often don’t represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.
Thank you for the comment, I appreciate this perspective. A couple of things to clarify:
As I quoted from “Emergence of brains”, I think the more traditional approach of biology is to embrace the complexity of living systems. In contrast, physics tends to seek more and more general and unifying principles (symmetries, order, curvature) that apply broadly across the universe, whether it’s matter or living systems, or even abstract phenomena like financial markets.
In the context of interpretability, I use this analogy to describe two (main) distinct approaches: one is self-proclaimed “biological”, and nowadays is mostly about breaking down models into features and circuits, and the other is more emergentist, and seek to find, for example, unifying principles in how models structure internal representations.
Again, it’s mostly an analogy, and both approaches are based on math (which is incidentally often similar to stat-mech math). And by “second law of thermodynamics” I did not mean it in a literal sense, but as an illustration for a principle which, without being very precise or practical, could provide a general guiding framework (eg, perpetual motion is not possible).
To conclude, I’m not saying the messiness of features and circuits is not useful; I think it’s actually fascinating. But I gently push for more recognition for the alternative approaches, which I believe will become very illuminating.
I have some objections to this, from the perspective of a doc I’m writing (and will possibly post in a few weeks). I think that you’re using biology as a synonym for feature/circuit microscopy, but I think there are some biologically motivated perspectives like self-organised criticality or systems neuroscience that use statistical-physics formalisms but are primarily biological in nature. Likewise, physics is not only about smooth, universal regularities. Phase transitions, renormalisation and critical phenomena are central to modern physics and they are violently non-smooth. That side of physics is almost completely absent in the piece, and overall I would say that the rhetorical contrast isn’t so clear as depicted in the article.
I agree it would be nice if we could get a second law of thermodynamics for LLMs. But safety interventions are usually enacted locally (gradient nudges, RLHF reward shaping, inference-time steering) and a thermodynamic state variable a la “entropy of the latent field” is almost certainly too coarse to guarantee that the next token is non-harmful. I think you underplay a manipulability criterion, where a variable is only valuable if you can steer it cheaply and predictably, which is why we might care about critical windows.
Finally I would also add that the messiness is in some ways the point. I don’t have a picture of misalignment as necessarily stemming from these really neat simplicities, I think there’s a lot of risk in being insufficiently granular if we elevate only the cleanest “order parameters” and discard messy local details. I would guess alignment failures often don’t represent as an scalar drifting past a threshold, and would rather be narrow‐band exploits or corner‐case correlations that are at the same granularity as messy feature and circuit probes you describe as a distraction. If you can jailbreak a frontier model with a one-sentence rhyme, then any interpretability story that averages over millions of parameters until it returns a single macro-variable is, by construction, blind to the event we need to prevent.
Thank you for the comment, I appreciate this perspective.
A couple of things to clarify:
As I quoted from “Emergence of brains”, I think the more traditional approach of biology is to embrace the complexity of living systems. In contrast, physics tends to seek more and more general and unifying principles (symmetries, order, curvature) that apply broadly across the universe, whether it’s matter or living systems, or even abstract phenomena like financial markets.
In the context of interpretability, I use this analogy to describe two (main) distinct approaches: one is self-proclaimed “biological”, and nowadays is mostly about breaking down models into features and circuits, and the other is more emergentist, and seek to find, for example, unifying principles in how models structure internal representations.
Again, it’s mostly an analogy, and both approaches are based on math (which is incidentally often similar to stat-mech math). And by “second law of thermodynamics” I did not mean it in a literal sense, but as an illustration for a principle which, without being very precise or practical, could provide a general guiding framework (eg, perpetual motion is not possible).
To conclude, I’m not saying the messiness of features and circuits is not useful; I think it’s actually fascinating. But I gently push for more recognition for the alternative approaches, which I believe will become very illuminating.