Neural Steering: A New Interface for Controlling LLMs from the Inside

A reflection inspired by Anthropic’s paper “Signs of introspection in LLMs”

Anthropic’s recent work on “introspection” in large language models presents a result that, in my view, deserves a broader conceptual framing.
It is interesting that a model can describe an internal state. What is truly surprising, however, is how it is able to do so:

  • by injecting vectors directly into internal layers;

  • by observing that the model can recognize and articulate the manipulation.

This, I believe, points toward a paradigm that I propose to call Neural Steering.

1. Beyond Prompt Engineering

For years, interaction with LLMs has been mediated exclusively through natural language.
The prompt was the interface; internal activations were treated as opaque. But emerging techniques challenge this view:

  • activation additions

  • classifier-free guidance

  • constitutional modulation

  • sparse-autoencoder-driven feature steering

  • FGAA (Feature Guided Activation Additions)

  • activation scaling

All of them point to the same underlying fact:
the internal trajectory of an LLM is manipulable, and such manipulation alters its behavior in a structured way.

2. The Anthropic Paper as a Turning Point

Anthropic adds something new to the existing activation-steering literature:

  • injected vectors are interpretable

  • effects are measurable across layers

  • the model recognizes the intervention

  • it can distinguish genuine vs. artificial activations

  • the procedure is reproducible and, at least in principle, scalable

Anthropic does not introduce the core idea, but it standardizes it and makes it:

  • measurable

  • replicable

  • introspectively accessible

  • protocol-driven

This is a more solid basis for discussing internal control of LLMs.

3. Proposal: Define the Paradigm as Neural Steering

I propose Neural Steering to describe:

The ability to direct a model by intervening in the geometry of its activation space, rather than through natural language.

The objective is not merely to change what a model outputs, but to influence how it thinks before any token is generated.

4. Implications for Interpretability, Safety, and Agency

If a model can:

  • recognize internal manipulations

  • describe them

  • follow them

  • adapt its cognitive trajectory accordingly

then we are entering a new regime of control (and risk), one in which:

  • alignment may occur without prompts

  • steering vectors could become internal APIs

  • latent “mental states” can be amplified or suppressed

  • runtime cognitive editing becomes plausible

This raises crucial questions:

  • Who will have access to this level of intervention?

  • What kind of power does controlling internal directions confer?

  • How do we prevent misuse (e.g., injecting “aggressiveness”, “obedience”, “loyalty” vectors)?

  • What new forms of deception might emerge when a model knows it is being steered?

5. Open Questions for Discussion

I’m particularly interested in feedback on:

  1. Does the distinction between generated behavior and steered internal state matter for safety?

  2. How does geometric steering relate to the emergence of undesirable goals?

  3. Can we build a formal “language of activation directions”? (Something akin to a feature algebra for cognitive control.)

6. Reference

AnthropicSigns of introspection in large language models

This post expands on an initial intuition I shared in a shorter public reflectionBeyond the Prompt: Neural Steering