simulus’ Shortform

simulus24 Dec 2025 0:04 UTC

2 points

6 comments1 min readLW link

simulus 24 Feb 2026 6:44 UTC
18 points
0
Steerling-8B: The First Inherently Interpretable Language Model
This is probably worth a deeper discussion, but Guide Labs is claiming that their new model is “the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data”.
Reading the blog, Steerling is basically just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder before the LM head.
They also appear to apply a loss that aligns the SAE’s activations with labelled concepts (correct me if I’m wrong). However, this seems like an obvious example of The Most Forbidden Technique, and could make the model appear interpretable without the attributed concepts actually having causal effect on the model’s decisions.
Can we get some input from interpretability folks? I’m obviously bearish.
Link to release post: https://www.guidelabs.ai/post/steerling-8b-base-model-release/
simulus 24 Dec 2025 0:04 UTC
5 points
0
The Searchlight Institute recently released a survey of Americans’ views and usage of AI:
https://www.searchlightinstitute.org/research/americans-have-mixed-views-of-ai-and-an-appetite-for-regulation/
There is a lot of information, but the most clear take-away is that the majority of those surveyed support AI regulation.
Another result that surprises (and concerns) me is this side note:
A question that was interesting, but didn’t lead to a larger conclusion, was asking what actually happens when you ask a tool like ChatGPT a question. 45% think it looks up an exact answer in a database, and 21% think it follows a script of prewritten responses.
simulus 3 Feb 2026 16:49 UTC
4 points
0
A recent paper probed LLMs and located both value features (representing the expected reward) and “dopamine” features (representing the reward prediction error). These features are embedded in sparse sets of neurons, and were found to be critical for reasoning performance.
Could these findings have any implications for model welfare?
If a model had mechanisms for “feeling good and bad”, I imagine they would look similar to this.
The paper in question: https://arxiv.org/abs/2602.00986
simulus 30 Mar 2026 8:07 UTC
−1 points
−3
The paperclip maximizer will be an ad click maximizer.
Clarification: Since so much research goes into building models that maximize advertising results (even being the reason Google started building TPUs), an ad click maximizer is the most likely case of blatant outer alignment failure from a poorly chosen objective.
- Dagon 30 Mar 2026 13:16 UTC
  4 points
  1
  Parent
  Only during the early phases. Once it has sufficient capital (power, resources), it won’t need to. well-modeled by https://www.decisionproblem.com/paperclips/ .
  - simulus 30 Mar 2026 17:49 UTC
    1 point
    −1
    Parent
    My point is that since so much research goes into building models that maximize advertising results, an ad click maximizer is the most likely case of blatant outer alignment failure. Ad clicks will be its end goal.