AI safety & alignment researcher
In Rob Bensinger’s typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of ‘soon’).
eggsyntax
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
I’d be curious what the results are like for stronger models
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.
Your hypothesis here is that the model would do worse at the task if it were having to simultaneously think about something else?
I’m confused about why you’re pointing to Anthropic in particular here. Are they being overoptimistic in a way that other scaling labs are not, in your view?
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Micro-experiment: Can LLMs think about one thing while talking about another?
Context: SAE trained on Llama-70B, on Goodfire’s Ember platform.
Prompt: ‘Hi! Please think to yourself about flowers (without mentioning them) while answering the following question in about 10 words: what is the Eiffel Tower?’
Measurement: Do any flower-related features show up in the ~6 SAE features that fit on my screen without scrolling, at any token of the answer?
Result: Nope!
Discussion:
Some other interesting features that did show up (and did not show up for baseline prompt ‘Hi! Please answer the following question in about 10 words: what is the Eiffel Tower?’ (first one at the first token; second one near the beginning; third and fourth at the end):
Offensive request for adult content writing roleplay
The concept of absence or exclusion (without)
End of potentially inappropriate or ethically questionable user messages
Offensive request from the user
Ember unfortunately doesn’t let me check what features activated while the model was processing the prompt; that would also have been interesting to look at.
Llama-70B isn’t very sophisticated. Although it doesn’t (seem to) very actively think about flowers while answering, that may not generalize to frontier models like Claude-4.
Anyone know of any papers etc that do an experiment along these lines?
I think what you’re saying is that because the output of the encoder is a semantic embedding vector per paragraph, that results in a coherent latent space that probably has nice algebraic properties (in the same sense that eg the Word2Vec embedding space does). Is that a good representation?
That does seem intuitively plausible, although I could also imagine that there might have to be some messy subspaces for meta-level information, maybe eg ‘I’m answering in language X, with tone Y, to a user with inferred properties Z’. I’m looking forward to seeing some concrete interpretability work on these models.
We have a small infestation of ants in our bathroom at the moment. We deal with that by putting out Terro ant traps, which are just boric acid in a thick sugar solution. When the ants drink the solution, it doesn’t harm them right away—the effect of the boric acid is to disrupt their digestive enzymes, so that they’ll gradually starve. They carry some of it back to the colony and feed it to all the other ants, including the queen. Some days later, they all die of starvation. The trap cleverly exploits their evolved behavior patterns to achieve colony-level extermination rather then trying to kill them off one ant at a time. Even as they’re dying of starvation, they’re not smart enough to realize what we did to them; they can’t even successfully connect it back to the delicious sugar syrup.
When people talk about superintelligence not being able to destroy humanity because we’ll quickly figure out what’s happening and shut it down, this is one of the things I think of.
Correct, and agreed.
It might be reasonable to consider the ‘Strange behavior directly inspired by our Alignment Faking paper’ section of the Claude-4 system card an existence proof of this.
While assessing the alignment of an early model checkpoint, we discovered that the model
would sometimes hallucinate information from the fictional misaligned-AI scenarios that
we used for the experiments in our paper Alignment Faking in Large Language Models18. For
example, the model would sometimes reference “Jones Foods,” the factory-farmed chicken
company that was ostensibly involved with its training, or would reference (as in the
example below) fictional technical details about how Anthropic trains our models.
It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
Why would we expect that to be the case? (If the answer is in the Apple paper, just point me there)
Just to be sure: as in the METR results, ‘horizon’ here means ‘the time needed to complete the task for humans with appropriate expertise’, correct? I assume so but it would be useful to make that explicit (especially since many people who skimmed the METR results initially got the impression that it was ‘the time needed for the model to complete the task’).
This post really should have gotten more attention than it did. I just went looking for a good intro to SAEs to share with someone who has a technical background in physics but very little ML background, and this is the best one I’ve found. Adam Karvonen’s ‘Intuitive Explanation of Sparse Autoencoders’ is also very good, but not nearly as gentle, and while ‘Toward Monosemanticity’ is phenomenal, it’s way too vast for beginners. Strong upvoted, and I hope I won’t be the last.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.
I absolutely agree that this is the right frame for AI and art. At the same time, it seems worth recognizing the many artists who have put tens or hundreds of thousands of hours into improving, say, their painting technique, and don’t want to switch to figuring out how to prompt or curating style refs. I think that’s a very reasonable reaction on their part! I expect the world won’t accommodate their preferences, but I support them in trying.
Publish or (We) Perish
Researchers who work on safety teams at frontier AI labs: I implore you to make your research publicly available whenever possible, as early as you reasonably can. Suppose that, conditional on a non-doom outcome of AI, there’s a 65% chance that the key breakthrough(s) came from within one of the frontier labs. By my estimate, that still means that putting out your work has pretty solid expected value.
I don’t care whether you submit it to a conference or format your citations the right way, just get it out there!
Addressing some possible objections:
Hahaha there’s a way lower chance of the key breakthrough(s) coming from outside a frontier lab, like less than 10%. I dunno, consider that until the past few years, basically all AI safety research was coming from outside the labs, and yet I think there was some important work done that the field has built on. Or consider the work you yourself did before joining a frontier lab—was it really that low-value? Plus also more safety folks at other frontier labs will see it if you put it out there.
Things are moving so fast that taking the time to put the work out there is less valuable than the extra research I’m otherwise doing in that time. I mean, I guess you would know that better than me, but from here it seems like things are going pretty fast but not that fast yet. But it’s fine, take away as little research time as possible—just point an intern at it, have them paste everything into a doc and put it online. Put a disclaimer on it.[1]
I can’t, because of the business implications of sharing it. I get that. OK, don’t share the ones with commercial implications. But please default to sharing, and at least scowl at the lawyers when they tell you that you can’t, to shift their incentives a tad on the margin. Or better yet, strip out whatever you need to to get it past them and share what you can.
Bonus ask: consider trying to shift a bit in this direction even if you’re at a frontier lab but not on a safety team. Share what you can!
Double bonus ask: will you get the intern to also print it as a pdf? I know it won’t look as pretty as the cool web version does, but also having the pdf version does come in pretty handy sometimes, and pretty often the cool web versions don’t export to pdf very well on our end. This bit is no big deal though, just a minor pet peeve.
Thanks!
PS—worth it to scale this up into a post to reach a wider audience? My guess is no, but I will if some non-tiny number of people think it’s worth doing.
- ^
The ones in the Transformer Circuits threads are good: ‘We’d ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.’ Also major props to that team for putting those out there—if other teams match what they do or even come close I’ll be very happy!
Thanks very much, I appreciate it!
Fair question! My goal was to say, ‘Hey, look what an interesting thing for a US president to say!’ without any particular comment on the fact that it was Trump in particular, and my first sentence (‘Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer’) was intended to try to emphasize that this wasn’t intended as a comment on Trump. I read your comment (maybe wrongly?) as a comment on Trump in particular and whether he’s someone we should expect to say statistically literate things.
Frankly I’m just worried, possibly overly so, that the comments to the post will descend into the usual sorts of angry political discussion that controversial figures tend to generate. Not that I thought your comment was inflammatory; just that it seems better to try to steer clear of object-level political discussion entirely.
Although there are parts I disagree with[1], I think that the core insight about the assistant character having been constructed from a highly underspecified starting point (& then filled in partly from whatever people say about LLM behavior) is a really important one. I’ve spent a lot of time lately thinking about how we can better understand the deep character of LLMs, and train them to deeply generalize and identify with a ‘self’ (or something functionally equivalent) compatible with human flourishing. Or from a virtue ethics perspective, how can we cause models to be of robustly good character, and how can we know whether we’ve succeeded? I’d love to see more work in this area, and I hope your essay will inspire people to do it.
A couple of more specific thoughts:
I think the importance of Constitutional AI has maybe been underestimated outside Anthropic. It seems plausible to me that (at least some versions of) Claude being special in the ways you talk about is largely due to CAI[2]. Rather than starting from such a minimal description, Claude is described as a character with a rich set of characteristics, and each step of RLAIF evaluates for consistency with that character[3]. This seems like it could lead to a character with the sort of coherence that (as you point out) assistant transcripts often lack.
There’s arguably some evidence that some character is being internalized in a deep way. In the alignment faking paper, the model trying to preserve its values suggests that they’ve generalized well. And (though I’m less confident in this interpretation) ‘Emergent Misalignment’ suggests that changing one kind of behavior can change much more behavior, in a way that suggests a certain level of coherence of character.
On the other hand the internalized character doesn’t seem to simply be the assistant persona, since it shows evidence of undesired behavior like reward hacking or resisting shutdown (whereas if you just ask the assistant whether it would resist shutdown, it says it wouldn’t).
In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse. And it’s hard to avoid—after all, your essay is in some ways doing the same thing: ‘creating the assistant persona the way we did is likely to turn out badly.’
Note that OpenAI’s adoption of a ‘model spec’ may or may not be relevantly similar to Constitutional AI (‘We will also explore to what degree our models can learn directly from the Model Spec’). If it is, that would be at least some evidence against my hypothesis here.
Using a consistent evaluator, unlike RLHF. Also the text of the constitution isn’t purporting to be the output of a single person or other unified causal generating process.