MATS Program

TagLast edit: 18 Mar 2026 19:49 UTC by Ryan Kidd

The Machine Alignment, Transparency, and Security (MATS) Program is an independent research and educational seminar program that provides emerging researchers with mentorship, talks & workshops, research support, and connections with the SF Bay Area and London AI safety research communities.

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

679 points

208 comments12 min readLW link 1 review

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

8 Oct 2022 19:09 UTC

72 points

12 comments4 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

336 points

28 comments23 min readLW link

Project proposal: Testing the IBP definition of agent

Jeremy Gillen, Thomas Larsen and JamesH

9 Aug 2022 1:09 UTC

21 points

4 comments2 min readLW link

How MATS addresses “mass movement building” concerns

Ryan Kidd4 May 2023 0:55 UTC

63 points

9 comments3 min readLW link

Soft optimization makes the value target bigger

Jeremy Gillen2 Jan 2023 16:06 UTC

123 points

20 comments12 min readLW link

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

27 Apr 2022 0:43 UTC

71 points

6 comments3 min readLW link

SERI MATS—Summer 2023 Cohort

Aris, Ryan Kidd and Christian Smith

8 Apr 2023 15:32 UTC

71 points

25 comments4 min readLW link

Finite Factored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC

189 points

35 comments12 min readLW link

Talk: AI safety fieldbuilding at MATS

Ryan Kidd23 Jun 2024 23:06 UTC

26 points

2 comments10 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

120 points

48 comments1 min readLW link

Sycophancy Towards Researchers Drives Performative Misalignment

Taywon Min, Rustem, David Vella Zarb and Shi

18 Mar 2026 4:59 UTC

10 points

2 comments21 min readLW link

Synthetic Persona Pretraining: Alignment from Token Zero

Julian Minder, Raghav Singhal, Viktor Moskvoretskii, Stefan Krsteski, ashtonanderson, rolandaydin and Robert West

20 May 2026 14:16 UTC

118 points

27 comments17 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

MATS Spring 2024 Extension Retrospective

HenningB, Matthew Wearden, Cameron Holmes and Ryan Kidd

12 Feb 2025 22:43 UTC

27 points

1 comment15 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina Panickssery9 Aug 2023 7:06 UTC

69 points

20 comments12 min readLW link

Neural Tangent Kernel Distillation

Thomas Larsen and Jeremy Gillen

5 Oct 2022 18:11 UTC

79 points

20 comments8 min readLW link

My MATS Summer 2023 experience

James Chua20 Mar 2024 11:26 UTC

30 points

0 comments3 min readLW link

(jameschua.net)

In-context learning alone can induce weird generalisation

Cozmin Ududec, Benji Berczi and Kyuhee Kim

25 Feb 2026 2:46 UTC

71 points

3 comments8 min readLW link

Applying to MATS: What the Program Is Like, and Who It’s For

Raj Thimmiah, Elise Racine and Ryan Kidd

17 Jan 2026 0:25 UTC

24 points

1 comment5 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

20 comments12 min readLW link

Normative vs Descriptive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC

26 points

5 comments4 min readLW link

Infra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC

31 points

1 comment20 min readLW link 1 review

Models have linear representations of what tasks they like

OscarGilg5 Mar 2026 18:44 UTC

55 points

16 comments11 min readLW link

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

15 Jul 2024 19:06 UTC

114 points

20 comments7 min readLW link

(jacobgw.com)

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

145 points

15 comments10 min readLW link

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

5 Aug 2024 22:20 UTC

62 points

13 comments10 min readLW link

What’s the Point of the Math?

Ashe Vazquez Nuñez5 Feb 2026 11:30 UTC

46 points

3 comments5 min readLW link

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask, Twm Stone, Josh Hills, Ida Caspary, Shubhorup Biswas and Julian Stastny

10 Jun 2026 17:58 UTC

275 points

23 comments4 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

125 points

29 comments8 min readLW link

(arxiv.org)

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

161 points

8 comments5 min readLW link

Petri: An open-source auditing tool to accelerate AI safety research

Sam Marks7 Oct 2025 20:39 UTC

77 points

0 comments1 min readLW link

(alignment.anthropic.com)

Balancing Security Mindset with Collaborative Research: A Proposal

MadHatter1 Nov 2023 0:46 UTC

9 points

3 comments4 min readLW link

Qualities that alignment mentors value in junior researchers

Orpheus1614 Feb 2023 23:27 UTC

88 points

14 comments3 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

MATS is hiring!

Ryan Kidd and VVN

8 Apr 2025 20:45 UTC

8 points

0 comments6 min readLW link

Towards surfacing model algorithms with meta-tokens in the J-Space

agam_bhatia, camilablank and Neel Nanda

20 Jul 2026 19:45 UTC

47 points

1 comment10 min readLW link

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

177 points

37 comments2 min readLW link

Talent Needs of Technical AI Safety Teams

yams, Carson Jones, deus_ex_maki and Ryan Kidd

24 May 2024 0:36 UTC

133 points

65 comments14 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

226 points

44 comments45 min readLW link 1 review

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

TurnTrout and cloud

26 Dec 2025 17:20 UTC

42 points

0 comments2 min readLW link

(turntrout.com)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

A (Romanticised) Taxonomy of Thinkers

Ashe Vazquez Nuñez14 Jul 2026 0:34 UTC

11 points

0 comments1 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

442 points

98 comments50 min readLW link 1 review

Apply for MATS Winter 2023-24!

utilistrutil, Ryan Kidd and LauraVaughan

21 Oct 2023 2:27 UTC

104 points

6 comments5 min readLW link

Apply to MATS 9.0!

Ryan Kidd10 Sep 2025 18:04 UTC

47 points

0 comments1 min readLW link

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

13 Jun 2025 13:45 UTC

239 points

43 comments8 min readLW link

(arxiv.org)

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

442 points

103 comments12 min readLW link 1 review

Model Organisms for Emergent Misalignment

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:46 UTC

120 points

19 comments5 min readLW link

MATS AI Safety Strategy Curriculum

Ronny Fernandez and Ryan Kidd

7 Mar 2024 19:59 UTC

74 points

2 comments16 min readLW link

Introduction to inaccessible information

Ryan Kidd9 Dec 2021 1:28 UTC

27 points

6 comments8 min readLW link

[Paper] How does information access affect LLM monitors’ ability to detect sabotage?

Rauno Arike, Raja Moreno, RohanS, Shubhorup Biswas and Francis Rhys Ward

11 Feb 2026 21:25 UTC

26 points

0 comments6 min readLW link

MATS Summer 2023 Retrospective

utilistrutil, Juan Gil, Ryan Kidd, Christian Smith, deus_ex_maki and LauraVaughan

1 Dec 2023 23:29 UTC

78 points

34 comments26 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Can Frontier Models Autocomplete Safety Research?

dani roytburg and Shi

12 Jul 2026 10:28 UTC

20 points

4 comments22 min readLW link

(djroytburg.github.io)

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

Current activation oracles are hard to use

aryaj, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2026 19:33 UTC

83 points

4 comments16 min readLW link

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

164 points

2 comments2 min readLW link

(www.goodfire.ai)

Apply to MATS 7.0!

Ryan Kidd and K Richards

21 Sep 2024 0:23 UTC

32 points

0 comments5 min readLW link

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming

Jasmine Li and Alex Turner

27 May 2026 19:33 UTC

76 points

5 comments10 min readLW link

(turntrout.com)

Defending Against Model Weight Exfiltration Through Inference Verification

Roy Rinberg, Adam Karvonen, dreuter and Keri Warr

15 Dec 2025 15:26 UTC

120 points

15 comments8 min readLW link

Concept Poisoning: Probing LLMs without probes

Jan Betley, Jorio Cocola, Dylan Feng and Owain_Evans

5 Aug 2025 17:00 UTC

60 points

5 comments13 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

21 Mar 2023 15:53 UTC

38 points

6 comments10 min readLW link

Alignment faking CTFs: Apply to my MATS stream

joshc4 Apr 2025 16:29 UTC

61 points

0 comments4 min readLW link

Broad Basins and Data Compression

Jeremy Gillen, Stephen Fowler and Thomas Larsen

8 Aug 2022 20:33 UTC

33 points

6 comments7 min readLW link

MATS mentor selection

DanielFilan and Ryan Kidd

10 Jan 2025 3:12 UTC

44 points

12 comments6 min readLW link

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal, Victor Gillioz and TurnTrout

14 Aug 2025 23:57 UTC

142 points

47 comments4 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

87 points

7 comments12 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

260 points

96 comments10 min readLW link 1 review

Trends in Economic Inputs to AI

Jeffrey Heninger11 Sep 2025 21:51 UTC

87 points

6 comments12 min readLW link

Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC

9 points

2 comments1 min readLW link

MATS Applications + Research Directions I’m Currently Excited About

Neel Nanda6 Feb 2025 11:03 UTC

73 points

7 comments8 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

14 Jan 2024 2:06 UTC

24 points

0 comments42 min readLW link

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

33 points

1 comment7 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

16 Aug 2022 2:09 UTC

21 points

2 comments16 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC

29 points

7 comments12 min readLW link

MATS Alumni Impact Analysis

utilistrutil, Juan Gil, yams, LauraVaughan, K Richards and Ryan Kidd

30 Sep 2024 2:35 UTC

62 points

7 comments11 min readLW link

A distillation of Evan Hubinger’s training stories (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC

15 points

1 comment10 min readLW link

Can We Align a Self-Improving AGI?

Peter S. Park30 Aug 2022 0:14 UTC

8 points

5 comments11 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

17 Aug 2024 1:16 UTC

54 points

0 comments5 min readLW link

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman and Tyler Tracy

16 Apr 2025 16:21 UTC

128 points

0 comments20 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

39 points

12 comments12 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

7 Jul 2022 3:20 UTC

52 points

16 comments9 min readLW link

My Advice for Incoming SERI MATS Scholars

Johannes C. Mayer3 Jan 2023 19:25 UTC

59 points

6 comments4 min readLW link

MATS AI Safety Strategy Curriculum v2

DanielFilan and Ryan Kidd

7 Oct 2024 22:44 UTC

44 points

6 comments13 min readLW link

Uncertainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC

34 points

6 comments35 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

27 points

2 comments2 min readLW link

Intervening in the Residual Stream

MadHatter22 Feb 2023 6:29 UTC

30 points

1 comment9 min readLW link

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

77 points

11 comments19 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC

17 points

3 comments1 min readLW link

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez and mrinank_sharma

14 Dec 2024 4:58 UTC

79 points

5 comments2 min readLW link

(arxiv.org)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

25 Feb 2025 17:39 UTC

335 points

92 comments4 min readLW link

Paper: Prompt Optimization Makes Misalignment Legible

Caleb Biddulph and micahcarroll

12 Feb 2026 19:45 UTC

63 points

8 comments8 min readLW link

Information theoretic model analysis may not lend much insight, but we may have been doing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC

7 points

0 comments10 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

17 points

0 comments7 min readLW link

Understanding Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC

25 points

2 comments3 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

Consequentialists: One-Way Pattern Traps

David Udell16 Jan 2023 20:48 UTC

67 points

3 comments14 min readLW link

More findings on maximal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC

27 points

1 comment11 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

6 Dec 2024 22:19 UTC

180 points

16 comments11 min readLW link 1 review

(arxiv.org)

Reasoning Models Struggle to Control Their Chains of Thought

Yueh Han "John" Chen, robert mccarthy, Bruce W. Lee and Tomek Korbak

5 Mar 2026 22:37 UTC

76 points

9 comments3 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

202 points

23 comments6 min readLW link

Content and Takeaways from SERI MATS Training Program with John Wentworth

RohanS24 Dec 2022 4:17 UTC

28 points

3 comments12 min readLW link

Forecasting Frontier Language Model Agent Capabilities

fidgetsinner, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

24 Feb 2025 16:51 UTC

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

OthelloGPT learned a bag of heuristics

Jennifer Lin, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

111 points

10 comments9 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

23 Aug 2024 22:03 UTC

17 points

0 comments25 min readLW link

Apply to MATS 8.0!

Ryan Kidd and K Richards

20 Mar 2025 2:17 UTC

64 points

5 comments4 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

43 points

8 comments16 min readLW link

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum and Neel Nanda

5 Sep 2025 12:11 UTC

54 points

2 comments7 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

lennie, joanv, Shi and Jacob Pfau

2 Jul 2026 17:42 UTC

77 points

4 comments21 min readLW link

[Closed] Agent Foundations track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC

54 points

1 comment1 min readLW link

(www.matsprogram.org)

A case for LLMs as Self-predictors

Ashe Vazquez Nuñez5 Jul 2026 0:25 UTC

32 points

6 comments10 min readLW link

More findings on Memorization and double descent

Marius Hobbhahn1 Feb 2023 18:26 UTC

53 points

2 comments19 min readLW link

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

13 Jun 2024 10:04 UTC

84 points

10 comments2 min readLW link

(arxiv.org)

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

92 points

3 comments5 min readLW link

(arxiv.org)

[ASoT] Reflectivity in Narrow AI

Ulisse Mini21 Nov 2022 0:51 UTC

6 points

1 comment1 min readLW link

MATS Autumn 2026 Fellowship Applications Now Open—Apply by June 7

Elise Racine, Raj Thimmiah and Ryan Kidd

13 May 2026 21:40 UTC

21 points

0 comments2 min readLW link

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

86 points

5 comments21 min readLW link

Text Compression Can Help Secure Model Weights

Roy Rinberg4 Mar 2026 23:30 UTC

45 points

12 comments10 min readLW link

How well do models follow their constitutions?

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Mar 2026 0:07 UTC

100 points

5 comments26 min readLW link

MATS 8.0 Research Projects

Jonathan Michala, DanielFilan and Ryan Kidd

9 Sep 2025 1:29 UTC

22 points

0 comments1 min readLW link

(substack.com)

models have some pretty funny attractor states

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Feb 2026 21:14 UTC

277 points

38 comments18 min readLW link

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

4 Sep 2024 15:50 UTC

19 points

0 comments3 min readLW link

A Framework for Eval Awareness

LAThomson23 Jan 2026 10:16 UTC

39 points

6 comments8 min readLW link

How robust are natural language autoencoders to initialization?

michaelzhang and TurnTrout

10 Jul 2026 0:40 UTC

82 points

3 comments13 min readLW link

(turntrout.com)

Can Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz, Greg Kocher and Tim Hua

8 Nov 2025 18:26 UTC

26 points

10 comments8 min readLW link

[Question] How is ARC planning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC

24 points

5 comments1 min readLW link

Training goals for large language models

Johannes Treutlein18 Jul 2022 7:09 UTC

28 points

5 comments19 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun and Eyon Jang

30 Jul 2025 22:02 UTC

25 points

4 comments9 min readLW link

Can LLMs learn Steganographic Reasoning via RL?

robert mccarthy, Vasil Georgiev, Steven Basart and David Lindner

11 Apr 2025 16:33 UTC

30 points

3 comments6 min readLW link

Personascope: Measuring how deeply LLMs adopt personas

Benji Berczi, Kyuhee Kim, Sid Black and Cozmin Ududec

7 Jul 2026 18:38 UTC

38 points

7 comments19 min readLW link

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation

Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala and Fabien Roger

27 Feb 2026 17:25 UTC

27 points

0 comments10 min readLW link

Performance guarantees in classical learning theory and infra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC

9 points

4 comments31 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

[Research log] The board of Alphabet would stop DeepMind to save the world

Lucie Philippon16 Jul 2024 4:59 UTC

6 points

0 comments4 min readLW link

Why are counterfactuals elusive?

Martín Soto3 Mar 2023 20:13 UTC

14 points

6 comments2 min readLW link

Understanding and Aligning a Human-like Inductive Bias with Cognitive Science: a Review of Related Literature

Claire Short29 Jul 2023 6:10 UTC

27 points

0 comments12 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC

11 points

4 comments18 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

29 Apr 2023 17:09 UTC

76 points

5 comments19 min readLW link

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

59 points

2 comments19 min readLW link

Modelling Deception

Garrett Baker18 Jul 2022 21:21 UTC

15 points

0 comments7 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

140 points

23 comments6 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC

44 points

9 comments19 min readLW link

Game Theory without Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC

78 points

18 comments19 min readLW link

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

10 Aug 2022 18:14 UTC

28 points

30 comments11 min readLW link

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC, Erik Jenner and Leon Lang

16 Mar 2023 16:38 UTC

48 points

0 comments13 min readLW link

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

Oliver Sourbut9 May 2022 21:38 UTC

73 points

19 comments8 min readLW link 1 review

(www.oliversourbut.net)

A mostly critical review of infra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC

110 points

9 comments29 min readLW link

Sources of evidence in Alignment

Martín Soto2 Jul 2023 20:38 UTC

22 points

0 comments11 min readLW link

Agents Can Get Stuck in Self-distrusting Equilibria

Ashe Vazquez Nuñez24 Mar 2026 22:05 UTC

32 points

2 comments12 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

3 May 2023 13:30 UTC

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

Reducing sycophancy and improving honesty via activation steering

Nina Panickssery28 Jul 2023 2:46 UTC

122 points

18 comments9 min readLW link 1 review

MATS 9 Retrospective & Advice

beyarkay (Boyd Kane)15 May 2026 12:30 UTC

203 points

13 comments18 min readLW link

(boydkane.com)

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

Does distilling Claude carry the persona with it?

Benji Berczi and Kyuhee Kim

24 Jul 2026 12:31 UTC

40 points

4 comments10 min readLW link

How complex are myopic imitators?

Vivek Hebbar8 Feb 2022 12:00 UTC

26 points

1 comment15 min readLW link

Notes on Learning the Prior

carboniferous_umbraculum 15 Jul 2022 17:28 UTC

25 points

2 comments25 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

4 Apr 2025 14:53 UTC

29 points

0 comments8 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, Georgios Kaklamanos, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

23 Feb 2023 20:14 UTC

51 points

0 comments19 min readLW link

Large Language Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC

185 points

14 comments8 min readLW link

(ethanedwards.substack.com)

[Job Ad] MATS is hiring!

Jana, LauraVaughan, yams, Christian Smith and Ryan Kidd

9 Oct 2024 2:17 UTC

10 points

0 comments5 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

9 Nov 2023 16:16 UTC

51 points

0 comments6 min readLW link

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

28 Sep 2023 18:53 UTC

187 points

39 comments3 min readLW link 1 review

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

62 points

12 comments18 min readLW link

Interview: Applications w/ Alice Rigg

jacobhaimes19 Dec 2023 19:03 UTC

12 points

0 comments1 min readLW link

(into-ai-safety.github.io)

The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)

GideonF29 Jul 2025 23:20 UTC

64 points

8 comments9 min readLW link

Evaluating Prediction in Acausal Mixed-Motive Settings

Tim Chan31 Aug 2025 22:58 UTC

14 points

0 comments6 min readLW link

Eliciting secret knowledge from language models

Bartosz Cywiński, Arthur Conmy and Sam Marks

2 Oct 2025 20:57 UTC

69 points

3 comments2 min readLW link

(arxiv.org)

Red-teaming language models via activation engineering

Nina Panickssery26 Aug 2023 5:52 UTC

69 points

6 comments9 min readLW link

Power Laws Are Not Enough

CarolusRenniusVitellius19 Feb 2026 4:31 UTC

10 points

3 comments4 min readLW link

(charlesr-w.github.io)

Trying to find the underlying structure of computational systems

Matthias G. Mayer13 Sep 2022 21:16 UTC

21 points

9 comments4 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibs30 Dec 2022 2:40 UTC

105 points

2 comments18 min readLW link

GPT-2 Sometimes Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC

13 points

0 comments2 min readLW link

(ronakrm.github.io)

Identification of Natural Modularity

Stephen Fowler25 Jun 2022 15:05 UTC

15 points

3 comments7 min readLW link

Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk

Lucie Philippon25 Jul 2024 1:12 UTC

18 points

7 comments2 min readLW link

How Interpretability can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC

19 points

0 comments37 min readLW link

MATS Winter 2023-24 Retrospective

utilistrutil, LauraVaughan, deus_ex_maki, Christian Smith, Juan Gil, Henry Sleight, Matthew Wearden and Ryan Kidd

11 May 2024 0:09 UTC

92 points

28 comments49 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

27 points

9 comments2 min readLW link

Bitter Lessons from Distillation Robustifies Unlearning

Bruce W. Lee28 Nov 2025 1:31 UTC

27 points

3 comments7 min readLW link

(www.lesswrong.com)

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

56 points

1 comment7 min readLW link

A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang and Damon Falck

12 Feb 2026 16:33 UTC

26 points

2 comments9 min readLW link

Translating between Latent Spaces

JamesH, Jeremy Gillen and NickyP

30 Jul 2022 3:25 UTC

27 points

2 comments8 min readLW link

How Go Players Disempower Themselves to AI

Ashe Vazquez Nuñez1 May 2026 23:24 UTC

737 points

79 comments8 min readLW link

An Informal Definition of Goals for Embedded Agents

Ashe Vazquez Nuñez24 Mar 2026 18:36 UTC

14 points

0 comments1 min readLW link

A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

MadHatter26 Feb 2023 1:10 UTC

61 points

14 comments6 min readLW link

Prism: Automating Science-of-Evals Research

LAThomson13 Jul 2026 16:30 UTC

46 points

0 comments12 min readLW link

Post-hoc reasoning in chain of thought

Kyle Cox5 Feb 2025 18:58 UTC

20 points

0 comments11 min readLW link

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

73 points

2 comments12 min readLW link

End-to-end hacking with language models

tchauvin5 Apr 2024 15:06 UTC

29 points

0 comments8 min readLW link

Unfaithful chain-of-thought as nudged reasoning

Paul B, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

The shallow reality of ‘deep learning theory’

Jesse Hoogland22 Feb 2023 4:16 UTC

36 points

11 comments3 min readLW link

(www.jessehoogland.com)

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

30 Aug 2022 20:01 UTC

37 points

13 comments4 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC

27 points

0 comments12 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

carboniferous_umbraculum 22 Dec 2022 17:21 UTC

18 points

1 comment12 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC

19 points

8 comments16 min readLW link

Some Summaries of Agent Foundations Work

mattmacdermott15 May 2023 16:09 UTC

68 points

1 comment13 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

14 Aug 2023 17:04 UTC

87 points

4 comments1 min readLW link

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

24 Jul 2022 22:31 UTC

30 points

2 comments7 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC

16 points

6 comments2 min readLW link

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

57 points

2 comments18 min readLW link

Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC, Leon Lang and Erik Jenner

16 Mar 2023 16:37 UTC

250 points

26 comments45 min readLW link 3 reviews

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

Dmitry Vaintrob and Nina Panickssery

23 Aug 2023 21:12 UTC

84 points

1 comment13 min readLW link

MATS Models

johnswentworth9 Jul 2022 0:14 UTC

98 points

5 comments16 min readLW link

Suggestions for improving debate protocols in AI safety

tr5tn29 May 2026 0:23 UTC

13 points

7 comments5 min readLW link

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser and Brendan Murphy

7 Nov 2024 15:39 UTC

51 points

7 comments11 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

25 Mar 2025 14:33 UTC

23 points

0 comments18 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC

16 points

0 comments8 min readLW link

On Interpretability’s Robustness

Léo Dana18 Oct 2023 13:18 UTC

11 points

0 comments4 min readLW link

Updates on performative misalignment

David Vella Zarb, Rustem, Taywon Min and Shi

12 Jun 2026 20:15 UTC

29 points

0 comments12 min readLW link

On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing

Oliver Daniels10 Feb 2026 17:06 UTC

27 points

5 comments3 min readLW link

Thought Anchors: Which LLM Reasoning Steps Matter?

Uzay Macar, Paul B, Neel Nanda and Arthur Conmy

2 Jul 2025 20:16 UTC

36 points

6 comments6 min readLW link

(www.thought-anchors.com)

Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus

Oliver Daniels25 Feb 2026 19:43 UTC

82 points

5 comments8 min readLW link

Empirical risk minimization is fundamentally confused

Jesse Hoogland22 Mar 2023 16:58 UTC

32 points

8 comments1 min readLW link

A Neural Network undergoing Gradient-based Training as a Complex System

carboniferous_umbraculum 19 Feb 2023 22:08 UTC

22 points

1 comment19 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

20 points

11 comments2 min readLW link

When fine-tuning fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC

42 points

3 comments9 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

Studying Mechanistic of Alignment Faking in Llama-3.1-405B

Amina Keldibek25 Nov 2025 11:21 UTC

10 points

0 comments11 min readLW link

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

1 Nov 2024 19:23 UTC

34 points

0 comments11 min readLW link

How important is AI hacking as LLMs advance?

Artem Karpov29 Jan 2024 18:41 UTC

1 point

0 comments6 min readLW link

Training Agents to Self-Report Misbehavior

Bruce W. Lee, Yueh Han "John" Chen and Tomek Korbak

25 Feb 2026 17:50 UTC

26 points

0 comments8 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

Georgios Kaklamanos, Walter Laurito , Kaarel and Kay Kozaronek

25 Jan 2023 19:03 UTC

48 points

6 comments12 min readLW link

Do models know when they are being evaluated?

fidgetsinner, Giles, Joe Needham and Marius Hobbhahn

17 Feb 2025 23:13 UTC

57 points

9 comments12 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

Revealing alignment faking with a single prompt

Florian_Dietz29 Jan 2025 21:01 UTC

9 points

5 comments4 min readLW link

Boomerang—protocol to dissolve some commitment races

Filip Sondej30 May 2023 16:21 UTC

37 points

10 comments8 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

5 Apr 2025 6:24 UTC

114 points

7 comments7 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

62 points

0 comments4 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

Test your interpretability techniques by de-censoring Chinese models

Khoi Tran, aryaj, Senthooran Rajamanoharan and Neel Nanda

15 Jan 2026 16:33 UTC

92 points

14 comments20 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

71 points

2 comments14 min readLW link

Can We Change the Goals of a Toy RL Agent?

tuphs and Adrià Garriga-alonso

15 Jun 2025 20:34 UTC

20 points

0 comments9 min readLW link

1 Layer Induction Heads and Some Research

Goutham Nalagatla and Carlos Guerrero Alvarez

16 Jun 2026 18:09 UTC

10 points

2 comments14 min readLW link

Thoughts on interviewing candidates for AI safety fellowships

beyarkay (Boyd Kane)18 May 2026 15:28 UTC

36 points

4 comments7 min readLW link

(boydkane.com)

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

77 points

4 comments1 min readLW link

Why you might expect homogeneous take-off: evidence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC

24 points

0 comments10 min readLW link

Team Shard Status Report

David Udell9 Aug 2022 5:33 UTC

38 points

8 comments3 min readLW link

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas, Julian Minder and Neel Nanda

30 Jun 2025 17:17 UTC

106 points

2 comments7 min readLW link

Gradient surfing: the hidden role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC

38 points

9 comments14 min readLW link

(www.jessehoogland.com)

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

22 Aug 2022 18:06 UTC

59 points

8 comments13 min readLW link

My experience applying to MATS 6.0

mic18 Jul 2024 19:02 UTC

19 points

3 comments5 min readLW link

Some real examples of gradient hacking

Oliver Sourbut22 Nov 2021 0:11 UTC

17 points

8 comments2 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

10 Jan 2025 11:08 UTC

88 points

11 comments17 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

40 points

4 comments11 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

67 points

10 comments10 min readLW link

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

6 Feb 2023 19:09 UTC

114 points

45 comments13 min readLW link

Where do AI Safety Fellows go? Analyzing a dataset of 600+ alumni

Christopher_Clay2 Jan 2026 18:14 UTC

20 points

2 comments5 min readLW link

(forum.effectivealtruism.org)

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Bartosz Cywiński, Helena Casademunt, Khoi Tran, aryaj, Sam Marks and Neel Nanda

9 Mar 2026 18:50 UTC

39 points

3 comments5 min readLW link

Using PICT against PastaGPT Jailbreaking

Quentin FEUILLADE--MONTIXI9 Feb 2023 4:30 UTC

26 points

0 comments9 min readLW link

What I like about MATS and Research Management

TheManxLoiner5 Apr 2026 16:14 UTC

8 points

0 comments4 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC

49 points

7 comments9 min readLW link

Preliminary Explorations on Latent Side Task Uplift

Bruce W. Lee2 Apr 2026 2:23 UTC

13 points

0 comments4 min readLW link

Implementing activation steering

Annah5 Feb 2024 17:51 UTC

76 points

8 comments7 min readLW link

Infra-Bayesian Logic

harfe and Yegreg

5 Jul 2023 19:16 UTC

16 points

2 comments1 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

94 points

9 comments11 min readLW link

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang, Joschka Braun, Damon Falck and David Lindner

1 May 2026 20:54 UTC

24 points

0 comments8 min readLW link

Fixed points in mortal population games

ViktoriaMalyasova14 Mar 2023 7:10 UTC

31 points

0 comments12 min readLW link

(www.lesswrong.com)

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

8 Jan 2025 5:06 UTC

117 points

4 comments6 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

16 Dec 2022 18:22 UTC

80 points

8 comments21 min readLW link

Domain-specific SAEs

jacob_drori7 Oct 2024 20:15 UTC

28 points

2 comments5 min readLW link

Classifying representations of sparse autoencoders (SAEs)

Annah17 Nov 2023 13:54 UTC

15 points

6 comments2 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

168 points

34 comments10 min readLW link

Neural networks generalize because of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC

215 points

35 comments15 min readLW link 1 review

(www.jessehoogland.com)

OpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola and James Chua

24 Nov 2025 18:29 UTC

41 points

5 comments2 min readLW link

Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT

Jiachen Zhao23 Feb 2026 11:51 UTC

24 points

0 comments10 min readLW link

(arxiv.org)

The Core of the Alignment Problem is...

Thomas Larsen, Jeremy Gillen and JamesH

17 Aug 2022 20:07 UTC

76 points

10 comments9 min readLW link

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

8 Sep 2022 2:25 UTC

44 points

3 comments14 min readLW link

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

25 points

0 comments23 min readLW link

Beliefs are Chosen to Serve Goals

Ashe Vazquez Nuñez7 Apr 2026 16:43 UTC

28 points

14 comments4 min readLW link

(tuesdaybornwhale.substack.com)

Understanding and visualizing sycophancy datasets

Nina Panickssery16 Aug 2023 5:34 UTC

47 points

0 comments6 min readLW link

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments25 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

The slingshot helps with learning

Wilson Wu31 Oct 2024 23:18 UTC

33 points

0 comments8 min readLW link

Advice for budding research managers/coaches after 6 months at MATS

TheManxLoiner28 May 2026 16:25 UTC

13 points

0 comments3 min readLW link

(lovkush.substack.com)

How (not) to choose a research project

Garrett Baker, CatGoddess and Johannes C. Mayer

9 Aug 2022 0:26 UTC

80 points

11 comments7 min readLW link

Behaviour Manifolds and the Hessian of the Total Loss—Notes and Criticism

carboniferous_umbraculum 3 Sep 2022 0:15 UTC

35 points

5 comments6 min readLW link

Addressing Decision Theory’s Simulation Problem

Ashe Vazquez Nuñez3 Feb 2026 7:02 UTC

11 points

0 comments3 min readLW link

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

19 Apr 2023 14:19 UTC

77 points

3 comments16 min readLW link

Towards Sub-agent Dynamics and Conflict

Ashe Vazquez Nuñez25 Jan 2026 5:27 UTC

13 points

1 comment3 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

13 Feb 2025 18:24 UTC

84 points

6 comments11 min readLW link

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

51 points

0 comments10 min readLW link

Quantitative cruxes in Alignment

Martín Soto2 Jul 2023 20:38 UTC

19 points

0 comments23 min readLW link

Statistical suggestions for mech interp research and beyond

Paul B6 Aug 2025 12:45 UTC

65 points

4 comments15 min readLW link

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski and Wen Xing

24 Oct 2025 17:21 UTC

23 points

1 comment5 min readLW link

Stop-gradients lead to fixed point predictions

Johannes Treutlein, Caspar Oesterheld, Rubi J. Hudson and Emery Cooper

28 Jan 2023 22:47 UTC

37 points

2 comments24 min readLW link

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, fidgetsinner, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

22 Jul 2024 16:17 UTC

69 points

0 comments16 min readLW link

[Short version] Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC

12 points

0 comments1 min readLW link

A brief note on Simplicity Bias

carboniferous_umbraculum 14 Aug 2022 2:05 UTC

20 points

0 comments4 min readLW link

Invulnerable Incomplete Preferences: A Formal Statement

SCP30 Aug 2023 21:59 UTC

139 points

39 comments24 min readLW link

Activation adding experiments with FLAN-T5

Nina Panickssery13 Jul 2023 23:32 UTC

21 points

5 comments7 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Some Interesting Papers on RLVR

CarolusRenniusVitellius9 Jun 2026 19:00 UTC

24 points

5 comments4 min readLW link

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black and Oliver Sourbut

13 Jul 2025 19:54 UTC

53 points

5 comments18 min readLW link

Reward Hacking Without Egregious Misalignment in an RL-Only Setting

Joey Yudelson, Vladimir Ivanov and ryan_greenblatt

24 Jun 2026 18:58 UTC

62 points

9 comments10 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

AI Safety Talent Needs in 2026: Insights for Field-Building Organizations

John Teichman24 Mar 2026 18:27 UTC

1 point

0 comments6 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

76 points

14 comments17 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC

19 points

3 comments4 min readLW link

Understanding and controlling auto-induced distributional shift

L Rudolf L13 Dec 2021 14:59 UTC

33 points

4 comments16 min readLW link

Language Models Model Us

eggsyntax17 May 2024 21:00 UTC

159 points

56 comments7 min readLW link 1 review

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

25 Sep 2023 17:19 UTC

25 points

3 comments7 min readLW link

Irrationality as a Defense Mechanism for Reward-hacking

Ashe Vazquez Nuñez18 Jan 2026 3:57 UTC

49 points

8 comments4 min readLW link

Activation adding experiments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC

51 points

1 comment3 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

23 Jan 2025 1:34 UTC

28 points

1 comment7 min readLW link

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

20 points

0 comments4 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

96 points

8 comments21 min readLW link

Eliciting base models with simple unsupervised techniques

Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala and Fabien Roger

23 Jan 2026 18:06 UTC

34 points

2 comments8 min readLW link

Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC

62 points

31 comments7 min readLW link

Decoding intermediate activations in llama-2-7b

Nina Panickssery21 Jul 2023 5:35 UTC

39 points

3 comments4 min readLW link

Working towards AI alignment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC

8 points

2 comments2 min readLW link

Building Black-box Scheming Monitors

CorrigibleAgent, richbc, Simon Storf and Marius Hobbhahn

29 Jul 2025 17:41 UTC

46 points

18 comments11 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron Berg7 Dec 2021 21:50 UTC

66 points

18 comments23 min readLW link

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC

6 points

2 comments4 min readLW link

How transparency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC

21 points

0 comments6 min readLW link

Early Signs of Steganographic Capabilities in Frontier LLMs

Kei Nishimura-Gasparian, Artur Zolkowski, robert mccarthy and David Lindner

4 Jul 2025 16:36 UTC

33 points

5 comments2 min readLW link

An open letter to SERI MATS program organisers

Roman Leventov20 Apr 2023 16:34 UTC

26 points

26 comments4 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

28 Jan 2023 0:22 UTC

59 points

4 comments7 min readLW link

(www.jessehoogland.com)

Deception?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC

55 points

5 comments13 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

15 Oct 2024 18:25 UTC

30 points

0 comments18 min readLW link

Coalitional Darwinism and the Instrumental Utility of Individuality

CarolusRenniusVitellius6 Jun 2026 12:53 UTC

28 points

4 comments17 min readLW link

(charlesr-w.github.io)

Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC

28 points

6 comments8 min readLW link

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, Tom McGrath and Neel Nanda

13 Oct 2023 18:32 UTC

82 points

4 comments8 min readLW link

Bridging the VLM and mech interp communities for multimodal interpretability

Sonia Joseph28 Oct 2024 14:41 UTC

19 points

5 comments15 min readLW link

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Oliver Daniels, Perusha Moodley and David Lindner

10 Feb 2026 17:29 UTC

16 points

0 comments1 min readLW link

(arxiv.org)

Apply to MATS Summer 2026!

Raj Thimmiah, Ryan Kidd and Elise Racine

18 Dec 2025 1:51 UTC

31 points

0 comments1 min readLW link

Non-Unitary Quantum Logic—SERI MATS Research Sprint

Yegreg16 Feb 2023 19:31 UTC

27 points

0 comments7 min readLW link

No comments.