lewis smith

Karma: 922

lewis smith 5 Dec 2025 23:29 UTC
1 point
0
on: The ‘strong’ feature hypothesis could be wrong
I think that the general argument I made in this post was correct, and anticipated a shift away from the strong feature hypothesis mode of thinking on SAEs. It’s hard to say the degree to which this was downstream of me publishing this post(probably mostly not? though it may have had some influence) but it probably deserves some Bayes points nevertheless. I think that many of the arguments in this post continue to influence my thinking, and there is a lot in the post that remains valuable.

In fact, if anything I think I should have been a bit more confident; the strong feature hypothesis is wrong would have been a better title. In particular I actually think that the criticism of infinite recursion was likely always fatal to versions of the SFH that included monosemanticity. That is: if a model understands a concept by activating a concept feature, then how does the concept feature understand the concept? Thinking about semantics is always prone to this sort of ‘homunculus’ fallacy.

Perhaps I should have drawn the distinction between the various forms of the SFH more explicitly. I think that the ‘atomic feature’ model, where atoms are the main ‘internal format’ of the model but we remain more agnostic about their interpretation is a more defensible and interesting one, though I think that I am still skeptical of it.

I think that there is a lot of interesting stuff in the post, but it could probably have benefitted from a more careful organisation and structuring of the argument; it has a fairly conversational, essayistic style which maybe makes it easier or more engaging to read, but it kind of rambles in places and occasionally the structure of the argument I’m making is a bit unclear, or jumps from one point to another related one without warning. This may have been related to an attempt to imitate the writing of two key influences on the post—Dennett and Wittgenstein—whose style is often conversational (and, in the case of Wittgenstein, often extremely terse and opaque), but it was probably misguided, and a more pedantic structure with enumerated points might have been better, if a bit less fun to write.

I think that one thing that maybe makes the post more likely to stand the test of time is it’s importance as a historical documentation of a view which was, I think, extremely widespread and influential in interpretability circles—the ‘strong feature hypothesis’, or ‘SAE realism’ - which is much less so now. As the post argues, I think this view was very rarely explicitly outlined or articulated, despite its influence, and so this post serves as an interesting reminder of this moment in our intellectual history. Given this, it’s a shame that I didn’t spend a bit more time explicitly describing the strong feature hypothesis, as I might have done with a more rigorous organisation of the post. But two aspects of the writing that are interesting are my assumption that the reader will recognise the view I was describing without much difficulty, and my politeness towards the viewpoint: these reflect that I expected my audience to be inclined to agree with the strong feature hypothesis at least initially, though it could also reflect the fact that I didn’t want my readers to think I was knocking down a strawman.

To the extent that the field has moved away from the strong feature hypothesis, arguing against it is a bit less likely to be relevant in the future. In terms of future value, I think that some of the more interesting things in the post are towards the end; I think that the tacit representation argument, and the point about the possible opacity of implementations of complex behaviour, is particularly important (though far from entirely novel, as it’s essentially straight out of Dennett). Nevertheless, I think this is a very important possibility that thinking about interpretability and safety neglects at our peril, and it’s probably one of the more evergreen bits of the post overall. I also think that I stand by the argument that it’s important to of focussing on function as a means to understand representations, rather than representations as a means to understand functions, as ultimately it’s functions that give the representations meaning rather than the other way around.

People continue to find SAEs valuable as an exploratory tool for debugging, though whether they are worth the cost of development remains an interesting question that I don’t have time to think about now. But I think that the usefulness is much more as an approximation and exploratory tool, rather than a search for the ‘ground truth’ variables of the model. This is a good thing, and it’s basically in line with what I was arguing for here.

I would like to have spent more time on this review, but I had to squeeze it in between looking after my very young daughter.

[Paper] Difficulties with Evaluating a Deception Detector for AIs

bilalchughtai, lewis smith and Neel Nanda

3 Dec 2025 20:07 UTC

30 points

1 comment5 min readLW link

(arxiv.org)

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

62 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

133 points

35 comments27 min readLW link

lewis smith 22 Oct 2025 11:31 UTC
1 point
0
in reply to: shawnghu’s comment on: The ‘strong’ feature hypothesis could be wrong

Aren’t the MLPs in a transformer straightforward examples of this?

Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.

I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.

I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.

lewis smith 21 Oct 2025 9:38 UTC
1 point
0
in reply to: shawnghu’s comment on: The ‘strong’ feature hypothesis could be wrong
in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)

lewis smith 20 Oct 2025 9:36 UTC
1 point
0
in reply to: Daniel Tan’s comment on: The ‘strong’ feature hypothesis could be wrong
I think that they are distinguishable. For instance, if you can find an example of a structure which doesn’t fit the ‘feature’ model but clearly serves some algorithmic function, that would seem to be strong counter-evidence? For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in ‘strong feature hypothesis’ form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features. The strong feature hypothesis does have the virtue of being strong; therefore it’s quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the ‘feature’ hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.

lewis smith 19 Aug 2025 14:36 UTC
1 point
0
on: The ‘strong’ feature hypothesis could be wrong
The weak LRH I would say is now well supported by considerable empirical evidence.
a couple of people have shrug reacted to this sentence. . I think the theory I had in mind as being ‘well supported by empirical evidence’ was something like ‘you can often find examples of networks representing stuff with a linear direction pretty uncontroversially’. I think this is probably still a fair statement, though it’s a bit vague, and you could argue that it’s a bit of a leap from this (which is like >0 things are represented as linear directions) to how I phrased the weak LRH in the post.

while I think this post has mostly aged quite well, I think, looking back, I was hedging to try to avoid writing a post entitled ‘here’s why SAEs are doomed’.

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

53 points

2 comments18 min readLW link

lewis smith 21 Jul 2025 13:24 UTC
4 points
2
in reply to: evhub’s comment on: evhub’s Shortform
Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it’s reasonable to probe whether attributing these mental states makes sense, and we shouldn’t just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

lewis smith 2 May 2025 11:42 UTC
1 point
0
in reply to: lewis smith’s comment on: A Problem to Solve Before Building a Deception Detector
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)

lewis smith 2 May 2025 11:30 UTC
1 point
0
in reply to: Joseph Bloom’s comment on: A Problem to Solve Before Building a Deception Detector
i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.

re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.

Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).

i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

115 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

lewis smith 11 Feb 2025 17:32 UTC
4 points
0
in reply to: lewis smith’s comment on: A Problem to Solve Before Building a Deception Detector
I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don’t rhyme with ‘I sure hope everything turns out to be a simple correspondence’! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it

lewis smith 11 Feb 2025 17:06 UTC
4 points
0
in reply to: eggsyntax’s comment on: A Problem to Solve Before Building a Deception Detector
I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:
- i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It’s also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can’t find any security flaws in this program).
- relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
- I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

lewis smith 11 Feb 2025 11:00 UTC
4 points
0
in reply to: eggsyntax’s comment on: A Problem to Solve Before Building a Deception Detector
I don’t think we actually disagree very much?
I think that it’s totally possible that there do turn out to be convenient ‘simple correspondences’ for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
re.
Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there’s still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state.
This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole.
I don’t think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it’s an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren’t arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis.

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

7 Feb 2025 19:35 UTC

77 points

12 comments14 min readLW link

lewis smith 29 Jan 2025 15:50 UTC
3 points
2
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
i do agree with that, although ‘step 1 is identify the problem’

lewis smith 29 Jan 2025 12:27 UTC
4 points
0
in reply to: Lucius Bushnaq’s comment on: eggsyntax’s Shortform
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.

lewis smith 23 Jan 2025 14:32 UTC
1 point
0
in reply to: Brent’s comment on: Sherlockian Abduction Master List
re. the article saying it’s hard to observe; I think the short nails are pretty hard to spot (many people keep their nails short) but the long fingerstyle nails are quite unusual looking, though also not that common.

lewis smith

[Paper] Difficul­ties with Eval­u­at­ing a De­cep­tion De­tec­tor for AIs

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

A Prag­matic Vi­sion for Interpretability

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector