Apollo Research (org)

TagLast edit: 19 Jul 2024 10:42 UTC by Lee Sharkey

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

109 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

5 Jul 2024 17:05 UTC

66 points

2 comments5 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

57 points

20 comments4 min readLW link

(arxiv.org)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and Alex Meinke

8 Jan 2024 18:24 UTC

58 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, Alex Meinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

20 May 2024 17:55 UTC

24 points

7 comments6 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

The case for AGI safety products

Marius Hobbhahn21 Jan 2026 17:23 UTC

68 points

7 comments12 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, Alex Meinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

40 comments5 min readLW link 1 review

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

226 points

11 comments8 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

126 points

18 comments18 min readLW link

Try training token-level probes

StefanHex14 Apr 2025 11:56 UTC

47 points

6 comments8 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

108 points

4 comments3 min readLW link

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

22 Jan 2024 20:30 UTC

75 points

13 comments9 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

97 points

23 comments22 min readLW link

An Opinionated Evals Reading List

Marius Hobbhahn and Jérémy Scheurer

15 Oct 2024 14:38 UTC

65 points

0 comments13 min readLW link

(www.apolloresearch.ai)

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8 Aug 2024 18:33 UTC

166 points

11 comments8 min readLW link

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

13 Nov 2023 19:33 UTC

54 points

0 comments18 min readLW link

(www.apolloresearch.ai)

SAE feature geometry is outside the superposition hypothesis

jake_mendel24 Jun 2024 16:07 UTC

229 points

18 comments11 min readLW link 1 review

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Activation space interpretability may be doomed

bilalchughtai and Lucius Bushnaq

8 Jan 2025 12:49 UTC

154 points

34 comments8 min readLW link

The Prompt Is the Tell, Not the Reasoning Trace—Eval Awareness

ratnaditya@gmail.com18 May 2026 5:12 UTC

1 point

0 comments8 min readLW link

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

29 Jan 2025 10:25 UTC

71 points

0 comments1 min readLW link

(arxiv.org)

No comments.

Apollo Re­search (org)

Apollo Research (org)