How To Go From Interpretability To Alignment: Just Retarget The Search
[EDIT: Many people who read this post were very confused about some things, which I later explained in What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? You might want to read that post first.]
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.
In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.
We’ll need to make two assumptions:
Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.
The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
Given these two assumptions, here’s how to use interpretability tools to align the AI:
Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
Identify the retargetable internal search process.
Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
Just retarget the search. Bada-bing, bada-boom.
Problems
Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or “human mimicry” are more likely than “human values”). But those details seem relatively straightforward.
A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.
Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).
Upsides
First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.
Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.
But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.
As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by (Aug 29, 2022, 1:23 AM; 413 points)
- Understanding and controlling a maze-solving policy network by (Mar 11, 2023, 6:59 PM; 334 points)
- Against Almost Every Theory of Impact of Interpretability by (Aug 17, 2023, 6:44 PM; 331 points)
- Natural Abstractions: Key Claims, Theorems, and Critiques by (Mar 16, 2023, 4:37 PM; 246 points)
- The Plan − 2022 Update by (Dec 1, 2022, 8:43 PM; 239 points)
- Shallow review of technical AI safety, 2024 by (Dec 29, 2024, 12:01 PM; 197 points)
- What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? by (Aug 15, 2022, 10:48 PM; 157 points)
- The Plan − 2023 Version by (Dec 29, 2023, 11:34 PM; 152 points)
- Evidence of Learned Look-Ahead in a Chess-Playing Neural Network by (Jun 4, 2024, 3:50 PM; 121 points)
- Long-Term Future Fund: April 2023 grant recommendations by (EA Forum; Aug 2, 2023, 1:31 AM; 107 points)
- But is it really in Rome? An investigation of the ROME model editing technique by (Dec 30, 2022, 2:40 AM; 105 points)
- Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by (Feb 2, 2024, 6:54 AM; 103 points)
- Searching for Search by (Nov 28, 2022, 3:31 PM; 97 points)
- Seriously, what goes wrong with “reward the agent when it makes you smile”? by (Aug 11, 2022, 10:22 PM; 87 points)
- Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by (Sep 14, 2023, 2:18 AM; 87 points)
- Disentangling Shard Theory into Atomic Claims by (Jan 13, 2023, 4:23 AM; 86 points)
- Decision Transformer Interpretability by (Feb 6, 2023, 7:29 AM; 85 points)
- 's comment on Lucius Bushnaq’s Shortform by (Feb 27, 2025, 4:39 PM; 83 points)
- Human Mimicry Mainly Works When We’re Already Close by (Aug 17, 2022, 6:41 PM; 82 points)
- Long-Term Future Fund: April 2023 grant recommendations by (Aug 2, 2023, 7:54 AM; 81 points)
- Neural uncertainty estimation review article (for alignment) by (Dec 5, 2023, 8:01 AM; 74 points)
- Why I’m not working on {debate, RRM, ELK, natural abstractions} by (Feb 10, 2023, 7:22 PM; 74 points)
- Review of AI Alignment Progress by (Feb 7, 2023, 6:57 PM; 72 points)
- High-level interpretability: detecting an AI’s objectives by (Sep 28, 2023, 7:30 PM; 72 points)
- Plan for mediocre alignment of brain-like [model-based RL] AGI by (Mar 13, 2023, 2:11 PM; 68 points)
- Can we efficiently explain model behaviors? by (Dec 16, 2022, 7:40 PM; 64 points)
- Finding Goals in the World Model by (Aug 22, 2022, 6:06 PM; 59 points)
- Perils of under- vs over-sculpting AGI desires by (Aug 5, 2025, 6:13 PM; 58 points)
- Voting Results for the 2022 Review by (Feb 2, 2024, 8:34 PM; 57 points)
- 2022 (and All Time) Posts by Pingback Count by (Dec 16, 2023, 9:17 PM; 53 points)
- AXRP Episode 22 - Shard Theory with Quintin Pope by (Jun 15, 2023, 7:00 PM; 52 points)
- Searching for a model’s concepts by their shape – a theoretical framework by (Feb 23, 2023, 8:14 PM; 51 points)
- The Shortest Path Between Scylla and Charybdis by (Dec 18, 2023, 8:08 PM; 50 points)
- On the lethality of biased human reward ratings by (Nov 17, 2023, 6:59 PM; 48 points)
- What specific thing would you do with AI Alignment Research Assistant GPT? by (Jan 8, 2023, 7:24 PM; 47 points)
- We have promising alignment plans with low taxes by (Nov 10, 2023, 6:51 PM; 46 points)
- Motivation control by (Oct 30, 2024, 5:15 PM; 45 points)
- Goals selected from learned knowledge: an alternative to RL alignment by (Jan 15, 2024, 9:52 PM; 42 points)
- [ASoT] Some thoughts on human abstractions by (Mar 16, 2023, 5:42 AM; 42 points)
- The Pointer Resolution Problem by (Feb 16, 2024, 9:25 PM; 41 points)
- Refining the Sharp Left Turn threat model, part 2: applying alignment techniques by (Nov 25, 2022, 2:36 PM; 39 points)
- Consider trying Vivek Hebbar’s alignment exercises by (Oct 24, 2022, 7:46 PM; 38 points)
- Striking Implications for Learning Theory, Interpretability — and Safety? by (Jan 5, 2024, 8:46 AM; 37 points)
- Discussing how to align Transformative AI if it’s developed very soon by (Nov 28, 2022, 4:17 PM; 37 points)
- Discussing how to align Transformative AI if it’s developed very soon by (EA Forum; Nov 28, 2022, 4:17 PM; 36 points)
- World-Model Interpretability Is All We Need by (Jan 14, 2023, 7:37 PM; 36 points)
- A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) by (May 16, 2023, 10:59 PM; 36 points)
- Sparse Autoencoders: Future Work by (Sep 21, 2023, 3:30 PM; 35 points)
- Scaffolded LLMs: Less Obvious Concerns by (Jun 16, 2023, 10:39 AM; 34 points)
- Auditing games for high-level interpretability by (Nov 1, 2022, 10:44 AM; 33 points)
- The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? by (May 28, 2025, 6:21 AM; 31 points)
- Features and Adversaries in MemoryDT by (Oct 20, 2023, 7:32 AM; 31 points)
- 's comment on LOVE in a simbox is all you need by (Jan 14, 2024, 12:49 AM; 22 points)
- My decomposition of the alignment problem by (Sep 2, 2024, 12:21 AM; 22 points)
- Searching for Searching for Search by (Feb 14, 2024, 11:51 PM; 21 points)
- Can We Change the Goals of a Toy RL Agent? by (Jun 15, 2025, 8:34 PM; 20 points)
- Looking for Goal Representations in an RL Agent—Update Post by (Aug 28, 2024, 4:42 PM; 19 points)
- Motivation control by (EA Forum; Oct 30, 2024, 5:15 PM; 18 points)
- Corrigibility Via Thought-Process Deference by (Nov 24, 2022, 5:06 PM; 18 points)
- 's comment on My AI Model Delta Compared To Yudkowsky by (Jun 11, 2024, 11:29 PM; 18 points)
- 's comment on A Rocket–Interpretability Analogy by (Oct 22, 2024, 1:13 AM; 17 points)
- Consider trying Vivek Hebbar’s alignment exercises by (EA Forum; Oct 24, 2022, 7:46 PM; 16 points)
- 's comment on The Field of AI Alignment: A Postmortem, and What To Do About It by (Dec 27, 2024, 9:21 AM; 16 points)
- Notes on Internal Objectives in Toy Models of Agents by (Feb 22, 2024, 8:02 AM; 16 points)
- 's comment on jacquesthibs’s Shortform by (Jun 12, 2024, 2:24 PM; 14 points)
- Some Thoughts on Mech Interp by (Sep 25, 2025, 6:10 AM; 13 points)
- How model editing could help with the alignment problem by (Sep 30, 2023, 5:47 PM; 12 points)
- Miscellaneous First-Pass Alignment Thoughts by (Nov 21, 2022, 9:23 PM; 12 points)
- Alignment Targets and The Natural Abstraction Hypothesis by (Mar 8, 2023, 11:45 AM; 10 points)
- 's comment on How are you dealing with ontology identification? by (Oct 5, 2022, 10:33 PM; 10 points)
- Towards a solution to the alignment problem via objective detection and evaluation by (Apr 12, 2023, 3:39 PM; 9 points)
- Finding the estimate of the value of a state in RL agents by (Jun 3, 2024, 8:26 PM; 8 points)
- Requisite Variety by (Apr 21, 2023, 8:07 AM; 6 points)
- 's comment on Tomás B.’s Shortform by (Sep 29, 2025, 1:35 PM; 6 points)
- Activation Engineering Theories of Impact by (Jul 18, 2024, 4:44 PM; 6 points)
- GPT-2 XL’s capacity for coherence and ontology clustering by (Oct 30, 2023, 9:24 AM; 6 points)
- 's comment on Nobody’s on the ball on AGI alignment by (EA Forum; Mar 29, 2023, 3:59 PM; 5 points)
- 's comment on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by (Jan 30, 2024, 5:08 AM; 5 points)
- 's comment on Current AIs Provide Nearly No Data Relevant to AGI Alignment by (Dec 16, 2023, 10:56 AM; 5 points)
- 's comment on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes by (May 1, 2023, 9:55 PM; 5 points)
- 's comment on AGI Timelines Are Mostly Not Strategically Relevant To Alignment by (Aug 23, 2022, 11:28 PM; 5 points)
- 's comment on Ryan Kidd’s Shortform by (Apr 17, 2025, 7:21 PM; 5 points)
- 's comment on The ‘strong’ feature hypothesis could be wrong by (Sep 15, 2024, 1:22 PM; 5 points)
- 's comment on Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by (Sep 15, 2023, 9:17 PM; 4 points)
- 's comment on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by (Jan 26, 2024, 9:43 PM; 4 points)
- Contrapositive Natural Abstraction—Project Intro by (Jun 24, 2024, 6:37 PM; 4 points)
- 's comment on Disentangling Shard Theory into Atomic Claims by (Jan 13, 2023, 7:29 PM; 4 points)
- 's comment on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability by (Nov 2, 2022, 4:55 PM; 4 points)
- 's comment on Why I’m Not (Yet) A Full-Time Technical Alignment Researcher by (May 25, 2023, 4:28 AM; 3 points)
- 's comment on Applying superintelligence without collusion by (Nov 8, 2022, 10:42 PM; 3 points)
- 's comment on Against Almost Every Theory of Impact of Interpretability by (Sep 5, 2023, 1:27 PM; 3 points)
- 's comment on ChatGPT (and now GPT4) is very easily distracted from its rules by (Mar 19, 2023, 6:14 PM; 3 points)
- 's comment on An open letter to SERI MATS program organisers by (Apr 24, 2023, 2:01 AM; 2 points)
- 's comment on Will Capabilities Generalise More? by (Nov 16, 2023, 6:39 PM; 2 points)
- 's comment on tailcalled’s Shortform by (Aug 16, 2024, 8:39 PM; 2 points)
- 's comment on Deceptive Alignment by (Feb 28, 2023, 12:19 AM; 2 points)
- 's comment on An open letter to SERI MATS program organisers by (Apr 24, 2023, 2:00 AM; 1 point)
- 's comment on quila’s Shortform by (Jun 10, 2024, 4:31 PM; 1 point)
- 's comment on High-level interpretability: detecting an AI’s objectives by (Sep 29, 2023, 11:11 AM; 1 point)
- Is Interpretability All We Need? by (Nov 14, 2023, 5:31 AM; 1 point)
- 's comment on tailcalled’s Shortform by (Aug 16, 2024, 6:14 PM; 1 point)
5 votes
Overall karma indicates overall quality.
1 vote
Agreement karma indicates agreement, separate from overall quality.
This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)