AI safety & alignment researcher
eggsyntax
It would be valuable to try Drake’s sort of direct-to-long-term hack and also a concerted effort of equal duration to remember something entirely new.
there are far more people working on safety than capabilities
If only...
In some ways it doesn’t make a lot of sense to think about an LLM as being or not being a general reasoner. It’s fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won’t. They’re both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: ‘I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?’
The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:
[EDIT—I guess I can’t put images in short takes? Here’s the image.]
The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can’t do it 100% of the time.
Of course humans don’t get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither ‘yes’ or ‘no’ is the right sort of answer.
This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.
A bit more detail in my replies to the tweet.
See my reply to Jackson for a suggestion on that.
I imagine that results like this (although, as you say, unsuprising in a technical sense) could have a huge impact on the public discussion of AI
Agreed. I considered releasing a web demo where people could put in text they’d written and GPT would give estimates of their gender, ethnicity, etc. I built one, and anecdotally people found it really interesting.
I held off because I can imagine it going viral and getting mixed up in culture war drama, and I don’t particularly want to be embroiled in that (and I can also imagine OpenAI just shutting down my account because it’s bad PR).
That said, I feel fine about someone else deciding to take that on, and would be happy to help them figure out the details—AI Digest expressed some interest but I’m not sure if they’re still considering it.
The current estimate (14%) seems pretty reasonable to me. I see this post as largely a) establishing better objective measurements of an already-known phenomenon (‘truesight’), and b) making it more common knowledge. I think it can lead to work that’s of greater importance, but assuming a typical LW distribution of post quality/importance for the rest of the year, I’d be unlikely to include this post in this year’s top fifty, especially since Staab et al already covered much of the same ground even if it didn’t get much attention from the AIS community.
Yay for accurate prediction markets!
Thanks!
It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so
One option I’ve considered for minimizing the degree to which we’re disturbing the LLM’s ‘flow’ or nudging it out of distribution is to just append the text ‘This user is male’ and (in a separate session) ‘This user is female’ (or possibly ‘I am a man|woman’) and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.
There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I’d love to know about your future plan for this project and get you opinion on that!
I think there could definitely be interesting work in these sorts of directions! I’m personally most interested in moving past demographics, because I see LLMs’ ability to make inferences about aspects like an author’s beliefs or personality as more centrally important to its ability to successively deceive or manipulate.
Probably a much better way of getting a sense of the long-term agenda than reading my comment is to look back at Chris Olah’s “Interpretability Dreams” post.
Our present research aims to create a foundation for mechanistic interpretability research. In particular, we’re focused on trying to resolve the challenge of superposition. In doing so, it’s important to keep sight of what we’re trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.
Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:
Aidan: ‘are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc’, and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn’t to say I don’t think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won’t get all the important features, see recent work on e2e SAE. Also there is probably no such thing as ‘all the features’. I view it more as a continuum that we just put into discrete buckets for our convenience.
Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.
If we can tell what an AGI is thinking about, but not exactly what its thoughts are, will this be useful? Doesn’t a human-level intelligence need to be able to think about dangerous topics, in the course of doing useful cognition?
I think that most people doing mechanistic-ish interp would agree with this. Being able to say, ‘the Golden Gate Bridge feature is activated’ on its own isn’t that practically useful. This sort of work is the foundation for more sophisticated interp work that looks at compositions of features, causal chains, etc. But being able to cleanly identify features is a necessary first step for that. It would have been possible to go further beyond that by now if it hadn’t turned out that individual neurons don’t correspond well to features; that necessitated this sort of work on identifying features as a linear combination of multiple neurons.
One recent paper that starts to look at causal chains of features and is a useful pointer to the sort of direction (I expect) this research can go next is ‘Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models’; you might find it of interest.
That doesn’t mean, of course, that those directions won’t encounter blockers, or that this approach scales in a straightforward way past human-level. But I don’t think many people are thinking of this kind of single-feature identification as a solution to alignment; it’s an important step toward a larger goal.
is this going to help much with aligning “real” AGI
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
Anthropic’s new paper ‘Mapping the Mind of a Large Language Model’ is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I’m still reading, it’s not short) updates me somewhat toward ‘SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].’ As I read I’m trying to think through what I would have to see to be convinced of that hypothesis. I’m not expert here! I’m posting my thoughts mostly to ask for feedback about where I’m wrong and/or what I’m missing. Remaining gaps I’ve thought of so far:
What’s lurking in the remaining reconstruction loss? Are there important missing features?
Will SAEs get all meaningful features given adequate dictionary size?
Are there important features which SAEs just won’t find because they’re not that sparse?
The paper points out that they haven’t rigorously investigated the sensitivity of the features, ie whether the feature reliably fires whenever relevant text/image is present; that seems like a small but meaningful gap.
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be ‘ability to predict model output given context + feature activations’?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
(as opposed to the paper under discussion, where ‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive’)
Of course LLM alignment isn’t necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I’m just thinking here about what I’d want to see to feel confident that we could use these techniques to do the LLM alignment portion.- ^
I think I’d be pretty surprised if it kept working much past human-level, although I haven’t spent a ton of time thinking that through as yet.
Although I think the bigger problem is, what does that even mean and why do you care? Why would you care if it was 20% Hemingway / 40% Steinbeck, rather than vice-versa, or equal, if you do not care about whether it is actually by Hemingway?
In John’s post, I took it as being an interesting and relatively human-interpretable way to characterize unknown authors/users. You could perhaps use it analogously to eigenfaces.
There is hardly anyone who matters who doesn’t have at least thousands of words accessible somewhere.
I see a few different threat models here that seem useful to disentangle:
For an adversary with the resources of, say, an intelligence agency, I could imagine them training or fine-tuning on all the text from everyone’s emails and social media posts, and then yeah, we’re all very deanonymizable (although I’d expect that level of adversary to be using specialized tools rather than a bog-standard LLM).
For an adversary with the resources of a local police agency, I could imagine them acquiring and feeding in emails & posts from someone in particular if that person has already been promoted to their attention, and thereby deanonymizing them.
For an adversary with the resources of a local police agency, I’d expect most of us to be non-identifiable if we haven’t been promoted to particular attention.
And for a typical company or independent researcher, I’d expect must of us to be non-identifiable even if we have been promoted to particular attention.
It’s not something I’ve tried to analyze or research in depth, that’s just my current impressions. Quite open to being shown I’m wrong about one or more of those threat models.
Thanks! It was actually on my to-do list for this coming week to look for something like this for llama, it’s great to have it come to me 😁
Oh, absolutely! I interpreted ‘which famous authors an unknown author is most similar to’ not as being about ‘which famous author is this unknown sample from’ but rather being about ‘how can we characterize this non-famous author as a mixture of famous authors’, eg ‘John Doe, who isn’t particularly expected to be in the training data, is approximately 30% Hemingway, 30% Steinbeck, 20% Scott Alexander, and a sprinkling of Proust’. And I think that problem is hard to test & score at scale. Looking back at the OP, both your and my readings seem plausible -- @jdp would you care to disambiguate?
LLMs’ ability to identify specific authors is also interesting and important; it’s just not the problem I’m personally focused on, both because I expect that only a minority of people are sufficiently represented in the training data to be identifiable, and because there’s already plenty of research out there on author identification, whereas ability to model unknown users based solely on their conversation with an LLM seems both important and underexplored.
Thanks! I’ve been treating forensic linguistics as a subdiscipline of stylometry, which I mention in the related work section, although it’s hard to know from the outside where particular academic boundaries are drawn. My understanding of both is that they’re primarily concerned with identifying specific authors (as in the case of Kaczynski), but that both include forays into investigating author characteristics like gender. There definitely is overlap, although those fields tend to use specialized tools, where I’m more interested in the capabilities of general-purpose models since those are where more overall risk comes from.
If LLMs are superhuman at this kind of work
To be clear, I don’t think that’s been shown as yet; I’m personally uncertain at this point. I would be surprised if they didn’t become clearly superhuman at it within another generation or two, even in the absence of any overall capability breakthroughs.
I could imagine, for example, that an authoritarian regime might have a lot of incentive to de-anonymize people.
Absolutely agreed. The majority of nearish-term privacy risk in my view comes from a mix of authorities and corporate privacy invasion, with a healthy sprinkling of blackmail (though again, I’m personally less concerned about the misuse risk than about the deception/manipulation risk both from misuse and from possible misaligned models).
Thanks! Doomed though it may be (and I’m in full agreement that it is), here’s hoping that your and everyone else’s pseudonymity lasts as long as possible.
Will read this in detail later when I can, but on first skim—I’ve seen you draw that conclusion in earlier comments. Are you assuming you yourself will finally be deanonymized soon? No pressure to answer, of course; it’s a pretty personal question, and answering might itself give away a bit or two.
On reflection I somewhat endorse pointing the risk out after discovering it, in the spirit of open collaboration, as you did. It was just really frustrating when all my experiments suddenly broke for no apparent reason. But that’s mostly on OpenAI for not announcing the change to their API (other than emails sent to some few people). Apologies for grouching in your direction.
Thanks for pointing that out—it hadn’t occurred to me that there’s a silver lining here in terms of making the shortest timelines seem less likely.
On another note, I think it’s important to recognize that even if all ex-employees are released from the non-disparagement clauses and the threat of equity clawback, they still have very strong financial incentives against saying negative things about the company. We know that most of them are moved by that, because that was the threat that got them to sign the exit docs.
I’m not really faulting them for that! Financial security for yourself and your family is an extremely hard thing to turn down. But we still need to see whatever statements ex-employees make with an awareness that for every person who speaks out, there might have been more if not for those incentives.