Steerling-8B: The First Inherently Interpretable Language Model
This is probably worth a deeper discussion, but Guide Labs is claiming that their new model is “the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data”.
Reading the blog, Steerling is basically just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder before the LM head.
They also appear to apply a loss that aligns the SAE’s activations with labelled concepts (correct me if I’m wrong). However, this seems like an obvious example of The Most Forbidden Technique, and could make the model appear interpretable without the attributed concepts actually having causal effect on the model’s decisions.
Can we get some input from interpretability folks? I’m obviously bearish.
There is a lot of information, but the most clear take-away is that the majority of those surveyed support AI regulation.
Another result that surprises (and concerns) me is this side note:
A question that was interesting, but didn’t lead to a larger conclusion, was asking what actually happens when you ask a tool like ChatGPT a question. 45% think it looks up an exact answer in a database, and 21% think it follows a script of prewritten responses.
A recent paper probed LLMs and located both value features (representing the expected reward) and “dopamine” features (representing the reward prediction error). These features are embedded in sparse sets of neurons, and were found to be critical for reasoning performance.
Could these findings have any implications for model welfare?
If a model had mechanisms for “feeling good and bad”, I imagine they would look similar to this.
Steerling-8B: The First Inherently Interpretable Language Model
This is probably worth a deeper discussion, but Guide Labs is claiming that their new model is “the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data”.
Reading the blog, Steerling is basically just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder before the LM head.
They also appear to apply a loss that aligns the SAE’s activations with labelled concepts (correct me if I’m wrong). However, this seems like an obvious example of The Most Forbidden Technique, and could make the model appear interpretable without the attributed concepts actually having causal effect on the model’s decisions.
Can we get some input from interpretability folks? I’m obviously bearish.
Link to release post: https://www.guidelabs.ai/post/steerling-8b-base-model-release/
The Searchlight Institute recently released a survey of Americans’ views and usage of AI:
https://www.searchlightinstitute.org/research/americans-have-mixed-views-of-ai-and-an-appetite-for-regulation/
There is a lot of information, but the most clear take-away is that the majority of those surveyed support AI regulation.
Another result that surprises (and concerns) me is this side note:
A recent paper probed LLMs and located both value features (representing the expected reward) and “dopamine” features (representing the reward prediction error). These features are embedded in sparse sets of neurons, and were found to be critical for reasoning performance.
Could these findings have any implications for model welfare?
If a model had mechanisms for “feeling good and bad”, I imagine they would look similar to this.
The paper in question: https://arxiv.org/abs/2602.00986