@Garrett Baker’s reply to this shortform post says a lot of what I might have wanted to say here, so this comment will narrowly scoped to places where I feel I can meaningfully add something beyond “what he said.”
First:
And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
Could you say more about what interp results, specifically, you’re referring to here? Ideally with links if the results are public (and if they’re not public, or not yet public, that in itself would be interesting to know).
I ask because this sounds very different from my read on the (public) evidence.
These models definitely do form (causally impactful) representations of the assistant character, and these representation are informed not just by the things the character explicitly says in training data but also by indirect evidence about what such a character would be like.
Consider for instance the SAE results presented in the Marks et al 2025 auditing paper and discussed in “On the Biology of a Large Language Model.” There, SAE features which activated on abstract descriptions of RM biases also activated on Human/Assistant formatting separators when the model had been trained to exhibit those biases, and these features causally mediated the behaviors themselves.
Or – expanding our focus beyond interpretability – consider the fact that synthetic document finetuning works at all (cf. the same auditing paper, the alignment faking paper, the recent report on inducing false beliefs, earlier foundational work on out-of-context learning, etc). Finetuning the model on (“real-world,” non-”chat,” “pretraining”-style) documents that imply certain facts about “the assistant” is sufficient to produce assistant behaviors consistent with those implications.
Or consider the “emergent misalignment” phenomenon (introduced here and studied further in many follow-up works, including this recent interpretability study). If you finetune an HHH model to write insecure code, the assistant starts doing a bunch of other “bad stuff”: the finetuning seems to update the whole character in a way that preserves its coherence, rather than simply “patching on” a single behavior incompatible with the usual assistant’s personality. (It seems plausible that the same kind of whole-character-level generalization is happening all the time “normally,” during the training one performs to produce an HHH model.)
I do agree that, even if we do have strong convergent evidence that the LLM is modeling the character/simulacrum in a way that pulls in relevant evidence from the pretraining distribution, we don’t have similar evidence about representations of the simulator/predictor itself.
But why should we expect to see them? As I said in the post – this was one of my main points – it’s not clear that “the assistant is being predicted by an LM” actually constrains expectations about the assistant’s behavior, so it’s not clear that this layer of representation would be useful for prediction.[1]
Second:
Garrett noted this, but just to confirm “from the horse’s mouth” – I was not trying to say that people shouldn’t talk about misalignment going forward. Multiple people have interpreted my post this way, so possibly I should have been more explicit about this point? I may write some longer thing clarifying this point somewhere. But I’m also confused about why clarification would be needed.
My post wasn’t trying to say “hey, you complete morons, you spent the last 10+ years causing misalignment when you could have done [some unspecified better thing].” I was just trying to describe a state of affairs that seems possibly worrying to me. I don’t care about some counterfactual hypothetical world where LW never existed or something; the ship has sailed there, the damage (if there is damage) has been done. What I care about is (a) understanding the situation we’re currently in, and (b) figuring out what we can do about it going forward. My post was about (a), while as Garrett noted I later said a few things about (b) here.
Nor, for that matter, was I trying to say “I’m making this novel brilliant point that no one has ever thought of before.” If you think I’m making already-familiar and already-agreed-upon points, all the better!
But “we’ve already thought about this phenomenon” doesn’t make the phenomenon go away. If my home country is at war and I learn that the other side has just launched a nuke, it doesn’t help me to hear a politician go on TV and say “well, we did take that unfortunate possibility into account in our foreign policy planning, as a decision-maker I feel I took a calculated risk which I still defend in hindsight despite this tragic eventuality.” Maybe that discussion is abstractly interesting (or maybe not), but I want from the TV now is information about (a) whether I or people I care about are going to be in the blast radius[2] and (b) how to best seek shelter if so.
EDIT: I originally had a footnote here about a hypothetical counterexample to this trend, but after thinking more about it I don’t think it really made sense.
And – to ensure the potential for hope is captured on this side of the analogy – I guess I’d also want to know whether the nuke is going to land at all, vs. being successfully intercepted or something.
Hey Jan, thanks for the response.
@Garrett Baker’s reply to this shortform post says a lot of what I might have wanted to say here, so this comment will narrowly scoped to places where I feel I can meaningfully add something beyond “what he said.”
First:
Could you say more about what interp results, specifically, you’re referring to here? Ideally with links if the results are public (and if they’re not public, or not yet public, that in itself would be interesting to know).
I ask because this sounds very different from my read on the (public) evidence.
These models definitely do form (causally impactful) representations of the assistant character, and these representation are informed not just by the things the character explicitly says in training data but also by indirect evidence about what such a character would be like.
Consider for instance the SAE results presented in the Marks et al 2025 auditing paper and discussed in “On the Biology of a Large Language Model.” There, SAE features which activated on abstract descriptions of RM biases also activated on Human/Assistant formatting separators when the model had been trained to exhibit those biases, and these features causally mediated the behaviors themselves.
Or – expanding our focus beyond interpretability – consider the fact that synthetic document finetuning works at all (cf. the same auditing paper, the alignment faking paper, the recent report on inducing false beliefs, earlier foundational work on out-of-context learning, etc). Finetuning the model on (“real-world,” non-”chat,” “pretraining”-style) documents that imply certain facts about “the assistant” is sufficient to produce assistant behaviors consistent with those implications.
Or consider the “emergent misalignment” phenomenon (introduced here and studied further in many follow-up works, including this recent interpretability study). If you finetune an HHH model to write insecure code, the assistant starts doing a bunch of other “bad stuff”: the finetuning seems to update the whole character in a way that preserves its coherence, rather than simply “patching on” a single behavior incompatible with the usual assistant’s personality. (It seems plausible that the same kind of whole-character-level generalization is happening all the time “normally,” during the training one performs to produce an HHH model.)
I do agree that, even if we do have strong convergent evidence that the LLM is modeling the character/simulacrum in a way that pulls in relevant evidence from the pretraining distribution, we don’t have similar evidence about representations of the simulator/predictor itself.
But why should we expect to see them? As I said in the post – this was one of my main points – it’s not clear that “the assistant is being predicted by an LM” actually constrains expectations about the assistant’s behavior, so it’s not clear that this layer of representation would be useful for prediction.[1]
Second:
Garrett noted this, but just to confirm “from the horse’s mouth” – I was not trying to say that people shouldn’t talk about misalignment going forward. Multiple people have interpreted my post this way, so possibly I should have been more explicit about this point? I may write some longer thing clarifying this point somewhere. But I’m also confused about why clarification would be needed.
My post wasn’t trying to say “hey, you complete morons, you spent the last 10+ years causing misalignment when you could have done [some unspecified better thing].” I was just trying to describe a state of affairs that seems possibly worrying to me. I don’t care about some counterfactual hypothetical world where LW never existed or something; the ship has sailed there, the damage (if there is damage) has been done. What I care about is (a) understanding the situation we’re currently in, and (b) figuring out what we can do about it going forward. My post was about (a), while as Garrett noted I later said a few things about (b) here.
Nor, for that matter, was I trying to say “I’m making this novel brilliant point that no one has ever thought of before.” If you think I’m making already-familiar and already-agreed-upon points, all the better!
But “we’ve already thought about this phenomenon” doesn’t make the phenomenon go away. If my home country is at war and I learn that the other side has just launched a nuke, it doesn’t help me to hear a politician go on TV and say “well, we did take that unfortunate possibility into account in our foreign policy planning, as a decision-maker I feel I took a calculated risk which I still defend in hindsight despite this tragic eventuality.” Maybe that discussion is abstractly interesting (or maybe not), but I want from the TV now is information about (a) whether I or people I care about are going to be in the blast radius[2] and (b) how to best seek shelter if so.
EDIT: I originally had a footnote here about a hypothetical counterexample to this trend, but after thinking more about it I don’t think it really made sense.
And – to ensure the potential for hope is captured on this side of the analogy – I guess I’d also want to know whether the nuke is going to land at all, vs. being successfully intercepted or something.