this exact thought experiment is discussed in a bit of detail by one of the big beasts of illusionism, daniel dennett here: https://web.ics.purdue.edu/~drkelly/DennettQuiningQualia1988.pdf#page=5.54 (especially section 3 though the whole thing is good).
lewis smith
Curing all diseases in the streets, automating B2B SAAS in the sheets
There is a persistent tendency, when articulating the benefits of developing AGI, to focus on medical benefits, like curing all diseases. An example of this is Jack Clarks recent talk—all the actual examples of AI usage in his article are like generic white-collar work (research for blogs, coding), but in his speculative story at the end about the amazing benefits of AGI, it’s all curing diseases again.
I find this rhetorical device a bit dishonest and glib. I’m sure there are people working on healthcare stuff (I know that there’s a decent amount at Alphabet, where I work) but lets be real: most AI research investment is not going into health stuff, it’s going into automating coding and white collar work, stuff like generative media, or trying to automate RSI. Now, there is of course an intellectual argument for why investing loads in getting LLMs to write code will eventually cure cancer (LLMs writes loads of code → LLMs writes code for recursive self improvement → achive superintelligence → ask the superintelligence to cure cancer for you).
if you really want to cure all diseases, maybe you have some galaxy-brained argument that this actually is the best approach. There are some arguments in favour: maybe its easier to get people to invest 19 squillion dollars in building AGI than in curing cancer? But obviously there are some downsides too; the whole ‘causing chaos and re-ordering our entire society and maybe killing everyone’ thing that Jack also mentions in his talk briefly.
Also, curing diseases very much comes at the end of this process, whereas a lot of the bad or ambiguous stuff happens before. This is what I mean by ‘curing all diseases in the streets, automating B2B saas in the sheets’ - if what you are actually doing, day to day, is making software and some kinds of research and mathematical theorems really cheap, or trying to get everyone on the planet to make decisions by talking to the same computer program, or whatever, then maybe we should think about the effects of that, and whether that is good or bad, and how that will affect our society, as much as we focus on how great it will be when the singularity cures cancer. For example, no one seems to think that alphafold is in danger of turning into the singularity, but alphafold is probably more likely to cure your disease than claude is pre-superintelligence. Medicine is a kind of knowledge work, but is it as amenable to automation and acceleration as other kinds? It has slow feedback loops, it’s messy, and it involves a lot of hands-on stuff. It’s probably much easier to make big advances in mathematics than to cure a disease using LLMs.
Thinking about the short term societal effects of AGI is really important, in my view, because the state of society during the singularity, if it happens, is pretty critical. If it’s going to be massively destabilised and unhappy, that matters. I think that Nate Silvers’ argument on this here is quite persuasive.
Jacks talk said we need to ‘explore the future, or retreat from the present’. It’s important to think about the future. But it’s also important to understand what you are actually doing, now, and what the nature of the company that’s about to make you a multi-billionare actually is. Anthropic is not worth a trillion dollars because they have cured cancer, they are worth a trillion dollars because they are trying to automate knowledge work. Gesturing vaguely at amazing stuff you think might happen once all the dust has settled is it’s own way of retreating from the present.
I think that the general argument I made in this post was correct, and anticipated a shift away from the strong feature hypothesis mode of thinking on SAEs. It’s hard to say the degree to which this was downstream of me publishing this post(probably mostly not? though it may have had some influence) but it probably deserves some Bayes points nevertheless. I think that many of the arguments in this post continue to influence my thinking, and there is a lot in the post that remains valuable.
In fact, if anything I think I should have been a bit more confident; the strong feature hypothesis is wrong would have been a better title. In particular I actually think that the criticism of infinite recursion was likely always fatal to versions of the SFH that included monosemanticity. That is: if a model understands a concept by activating a concept feature, then how does the concept feature understand the concept? Thinking about semantics is always prone to this sort of ‘homunculus’ fallacy.
Perhaps I should have drawn the distinction between the various forms of the SFH more explicitly. I think that the ‘atomic feature’ model, where atoms are the main ‘internal format’ of the model but we remain more agnostic about their interpretation is a more defensible and interesting one, though I think that I am still skeptical of it.
I think that there is a lot of interesting stuff in the post, but it could probably have benefitted from a more careful organisation and structuring of the argument; it has a fairly conversational, essayistic style which maybe makes it easier or more engaging to read, but it kind of rambles in places and occasionally the structure of the argument I’m making is a bit unclear, or jumps from one point to another related one without warning. This may have been related to an attempt to imitate the writing of two key influences on the post—Dennett and Wittgenstein—whose style is often conversational (and, in the case of Wittgenstein, often extremely terse and opaque), but it was probably misguided, and a more pedantic structure with enumerated points might have been better, if a bit less fun to write.
I think that one thing that maybe makes the post more likely to stand the test of time is it’s importance as a historical documentation of a view which was, I think, extremely widespread and influential in interpretability circles—the ‘strong feature hypothesis’, or ‘SAE realism’ - which is much less so now. As the post argues, I think this view was very rarely explicitly outlined or articulated, despite its influence, and so this post serves as an interesting reminder of this moment in our intellectual history. Given this, it’s a shame that I didn’t spend a bit more time explicitly describing the strong feature hypothesis, as I might have done with a more rigorous organisation of the post. But two aspects of the writing that are interesting are my assumption that the reader will recognise the view I was describing without much difficulty, and my politeness towards the viewpoint: these reflect that I expected my audience to be inclined to agree with the strong feature hypothesis at least initially, though it could also reflect the fact that I didn’t want my readers to think I was knocking down a strawman.
To the extent that the field has moved away from the strong feature hypothesis, arguing against it is a bit less likely to be relevant in the future. In terms of future value, I think that some of the more interesting things in the post are towards the end; I think that the tacit representation argument, and the point about the possible opacity of implementations of complex behaviour, is particularly important (though far from entirely novel, as it’s essentially straight out of Dennett). Nevertheless, I think this is a very important possibility that thinking about interpretability and safety neglects at our peril, and it’s probably one of the more evergreen bits of the post overall. I also think that I stand by the argument that it’s important to of focussing on function as a means to understand representations, rather than representations as a means to understand functions, as ultimately it’s functions that give the representations meaning rather than the other way around.
People continue to find SAEs valuable as an exploratory tool for debugging, though whether they are worth the cost of development remains an interesting question that I don’t have time to think about now. But I think that the usefulness is much more as an approximation and exploratory tool, rather than a search for the ‘ground truth’ variables of the model. This is a good thing, and it’s basically in line with what I was arguing for here.
I would like to have spent more time on this review, but I had to squeeze it in between looking after my very young daughter.
[Paper] Difficulties with Evaluating a Deception Detector for AIs
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Aren’t the MLPs in a transformer straightforward examples of this?
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)
I think that they are distinguishable. For instance, if you can find an example of a structure which doesn’t fit the ‘feature’ model but clearly serves some algorithmic function, that would seem to be strong counter-evidence? For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in ‘strong feature hypothesis’ form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features. The strong feature hypothesis does have the virtue of being strong; therefore it’s quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the ‘feature’ hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.
The weak LRH I would say is now well supported by considerable empirical evidence.
a couple of people have shrug reacted to this sentence. . I think the theory I had in mind as being ‘well supported by empirical evidence’ was something like ‘you can often find examples of networks representing stuff with a linear direction pretty uncontroversially’. I think this is probably still a fair statement, though it’s a bit vague, and you could argue that it’s a bit of a leap from this (which is like >0 things are represented as linear directions) to how I phrased the weak LRH in the post.
while I think this post has mostly aged quite well, I think, looking back, I was hedging to try to avoid writing a post entitled ‘here’s why SAEs are doomed’.
Towards data-centric interpretability with sparse autoencoders
Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it’s reasonable to probe whether attributing these mental states makes sense, and we shouldn’t just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)
i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.
re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.
Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).
i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don’t rhyme with ‘I sure hope everything turns out to be a simple correspondence’! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it
I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:
-
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It’s also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can’t find any security flaws in this program).
-
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
-
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
-
I don’t think we actually disagree very much?
I think that it’s totally possible that there do turn out to be convenient ‘simple correspondences’ for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
re.
Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there’s still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state.
This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole.
I don’t think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it’s an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren’t arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis.
right but I think that the question is more about what exactly is this property of ‘having experiences’, or what follows from it, that seperates illusionists from a more sort of ‘hard problem’ approach to conciousness (where the existence of such experiences is a sort of extra data that has to be explained)