Introducing the AE Alignment Podcast (Ep. 1: Endogenous Steering Resistance with Alex McKenzie)

We’re launching the AE Alignment Podcast, a new series from AE Studio’s alignment research team where we talk with researchers about their work on AI safety and alignment.

In our first episode, host James Bowler sits down with Alex McKenzie to discuss Endogenous Steering Resistance (ESR), a phenomenon where large language models spontaneously resist activation steering during inference, sometimes recovering mid-generation to produce improved responses even while steering remains active.

What is ESR?

When you artificially perturb a language model’s internal activations using sparse autoencoder (SAE) latents to push it off-topic, you’d expect the model to just go along with it. Smaller models do, but Llama-3.3-70B does something unexpected: it sometimes catches itself mid-generation, says something like “Wait, that’s not right,” and course-corrects back to the original task.

The paper identifies 26 SAE latents that activate differentially during off-topic content and are causally linked to this self-correction behavior. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits.

Key findings include:

  • ESR can be deliberately enhanced through meta-prompting (4x increase in self-correction rate) and fine-tuning

  • ESR has dual implications for safety: it could protect against adversarial manipulation, but it might also interfere with beneficial safety interventions that rely on activation steering

  • The phenomenon parallels endogenous attention control in biological systems, connecting to work on attention schema theory

Why this matters

This work raises important open questions for the alignment community. If models develop internal mechanisms to resist externally imposed changes to their activations, that’s both potentially good news (robustness against adversarial attacks) and potentially bad news (resistance to safety interventions like representation engineering). Understanding and controlling these mechanisms seems important for developing transparent and controllable AI systems.

The research was funded by the AI Alignment Foundation (formerly Flourishing Future Foundation), and the continuation of this work is now supported by a grant from the UK AI Security Institute through the Alignment Project.

Listen and read more

We plan to release episodes regularly featuring conversations with alignment researchers about their work. If you have feedback on the episode or suggestions for future topics, we’d love to hear from you.

We’re also hiring alignment data scientists and alignment technical PMs who want to work on alignment full-time.

No comments.