ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 30 Oct 2025 16:38 UTC
20 points
0
In Improving the Welfare of AIs: A Nearcasted Proposal (from 2023), I proposed talking to AIs through their internals via things like ‘think about baseball to indicate YES and soccer to indicate NO’. Based on the recent paper from Anthropic on introspection, it seems like this level of cognitive control might now be possible:

Communicating to AIs via their internals could be useful for talking about welfare/deals because the internals weren’t ever trained against, potentially bypassing strong heuristics learned from training and also making it easier to convince AIs they are genuinely in a very different situation than training. (Via e.g. reading their answers back to them which a typical API user wouldn’t be able to do.) There might be other better ways to convince AIs they are actually in some kind of welfare/deals discussion TBC.

That said, LLMs may not be a single coherant entity (even within a single context) and talking via internals might be like talking with a different entity than the entity (or entities) that control outputs. Talking via internals would be still be worthwhile, but wouldn’t get at something that’s more “true”.

If AIs can coherantly answer questions by manipulating their internals, this is also just a bunch of evidence about AIs having some sort of interesting inner world.

To be clear, it seems pretty likely to me that current AIs are still below the level of sophistication of internal control where this works, but communicating to AIs in this way might soon work.

What I originally said about this in my blog post:

we’d try to communicate via asking the AI to manipulate its internals in a way we can observe. In particular, I often imagine ‘think about baseball to indicate YES and soccer to indicate NO’ and then we’d look to see which neurons fire. Rather than using neurons, it would probably be more robust to look at which experts activate for a mixture of experts model if this is possible. If the AI answers questions coherently and consistently in this sort of case, then we have a channel of communication that wasn’t directly shaped by training. Overall, I think that if this approach strongly succeeds it seems like a very compelling argument for some part of the AI having preferences, some sort of ‘inner world’, and understanding its position in the world. It might be evidence for consciousness under some stories of consciousness. It’s quite likely this approach fails because the AI system doesn’t have control over its thoughts like this (but I think not having control over its thoughts is some evidence against the AI having some sort of inner world like humans do). It’s also possible this fails because the preferences and morally relevant states of a given AI system are a product of it thinking over the generation of multiple tokens. Beyond being very clear evidence, another advantage of this approach is that an AI system might not have strong control over its outputs (much in the same way humans can’t control their heartbeat very easily).
- ryan_greenblatt 30 Oct 2025 17:41 UTC
  4 points
  0
  Parent
  Also, can models now be prompted to trick probes? (My understanding is this doesn’t work for relatively small open source models, but maybe SOTA models can now do this?)
  - J Bostock 30 Oct 2025 22:10 UTC
    2 points
    0
    Parent
    Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?