Discussion: Challenges with Unsupervised LLM Knowledge Discovery

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.

Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.

Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed.

What does CCS try to do?

To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps:

  • Knowledge-like property: write down a property that points at an LLM feature which represents the model’s knowledge (or a small number of features that includes the model-knowledge-feature).

  • Formalisation: make that property mathematically precise so you can search for features with that property in an unsupervised way.

  • Search: find it (e.g., by optimising a formalised loss).

In the case of CCS, the knowledge-like property is negation-consistency, the formalisation is a specific loss function, and the search is unsupervised learning with gradient descent on a linear + sigmoid function taking LLM activations as inputs.

We were pretty excited about this. We especially liked that the approach is not supervised. Conceptually, supervising ELK seems really hard: it is too easy to confuse what you know, what you think the model knows, and what it actually knows. Avoiding the need to write down what-the-model-knows labels seems like a great goal. [EDIT: Avoiding the need for supervision is especially important in the worst (or bad) case for ELK. In many average-case scenarios, it probably is feasible to use supervision (e.g., Azaria & Mitchell) and that seems like a good place to start.]

Why we think CCS isn’t working

We spent a lot of time playing with CCS and trying to make it work well enough to build a deception detector by measuring the difference between elicited model’s knowledge and stated claims.[1] Having done this, we are now not very optimistic about CCS or things like it.

Partly, this is because the loss itself doesn’t give much reason to think that it would be able to find a knowledge-like property and empirically it seems to find whatever feature in the dataset happens to be most prominent, which is very prompt-sensitive. Maybe something building off it could work in the future, but we don’t think anything about CCS provides evidence that it would be likely to. As a result, we have basically returned to our priors about the difficulty of ELK, which are something between “very very difficult” and “approximately impossible” for a full solution, while mostly agreeing that partial solutions are “hard but possible”.

What does the CCS loss say?

The CCS approach is motivated like this: we don’t know that much about the model’s knowledge, but probably it follows basic consistency properties. For example, it probably has something like Bayesian credences and when it believes A with some probability PA, it ought to believe ¬A with probability 1−PA.[2] So if we search in the LLM’s feature space for features that satisfy this consistency property, the model’s knowledge is going to be one of the things that satisfies it. Moreover, they hypothesise, there probably aren’t that many things that satisfy this property, so we can easily check the handful that we get and find the one representing the model’s knowledge.

When we dig into the CCS loss, it isn’t clear that it really checks for what it’s supposed to. In particular, we prove that arbitrary features, not just knowledge, satisfy the consistency structure of the CCS loss. In particular:

  • For every possible binary classification of propositions, there is a zero-loss probe on the associated contrast pairs that induces that classification.

  • For every possible binary classification of propositions, for any existing probe, there is a probe that has identical loss to the existing probe that induces that classification.

To us, these basically say “If there’s evidence that CCS is working, it isn’t that the loss logically or conceptually implies that it would work. It becomes an empirical claim about inductive biases.” As a further but slightly less confidently held point: ELK is the sort of thing where it’s hard to put much faith in inductive biases.

There are some nuances to these proofs, which we go into in the paper. For example, while we prove that these probes exist, we don’t prove that they can be represented by the specific linear probes that are used.[3] However, we do show empirically that the linear + sigmoid probes do recover non-knowledge features that we introduce and measure for.

These theorems are also just about the CCS loss. One can imagine other consistency properties that are more specific. These might hypothetically allow the construction of other losses overcoming some of the problems here, but we think they will face other challenges.

Are there really only a few knowledge-like features?

A motivating hypothesis behind CCS that carries a lot of weight is the idea that there may be a few linear features that satisfy the loss, but there won’t be that many, and so you can check them all and see which one is the model’s knowledge.

We have conceptual reasons to be suspicious of this, though we don’t have solid empirical evidence it is wrong.

First, it isn’t obvious that it would be easy to check which of N candidate features is actually the model’s knowledge, especially if they are all plausible candidates.

But more, we expect that there will just be lots of very similar features that will satisfy most properties that the model’s knowledge satisfies. The main reason is that we expect the models driving superintelligent agents to be able to simulate other agents, including other superintelligent agents, and to represent the knowledge of those agents in ways that are similar to how they represent their own knowledge (at least up to many consistency properties).

We think that it will be possible for these agents to represent extremely large numbers of these other entities because of superposition in LLMs, so it will not be the case that there are only a few to check. For example, it might be that some features only appear by choosing a specific prompt to surface that feature (our empirical results suggest this is likely). It may not be the case that the model’s own knowledge appears regardless of how you prompt the model.

Even though the superintelligent agent will have very different knowledge to a human, it will not necessarily have different knowledge to the simulated entities that an ELK method might accidentally discover. The simulated entities don’t have to be human.[4] For example, it seems likely that a superintelligent misaligned agent would have a very good model of an aligned superintelligent agent floating around. Or it might just have a model of an almost identical agent that is wrong about the key point you are probing for. This might sound perverse, but because imagining what is implied by the world where you are wrong about a proposition is a really important reasoning tool, that exact model sounds like it might be an important simulacrum for an agent to be considering when contemplating a proposition.

Even if you only care about having an oracle that predicts true things about the world, rather than eliciting what the model itself “knows”, the inability to distinguish propositional knowledge from different simulated entities is a problem. These entities might disagree about important predictions and our limited ground-truths may not be able to distinguish them.

We might be wrong here. For example, we find it moderately plausible that there is some kind of property that LLM’s driving various agents use to store their “own” knowledge in a way that is different from how it stores knowledge of other agents.[5] But this means that any kind of consistency property someone uses would have to specifically search for properties that are true of agent’s-own-knowledge, not knowledge in general, not even super-smart knowledge. We spent some time trying to think of ways to do this and failed, but maybe someone else will succeed. The authors broadly agree that a good mechanistic understanding of knowledge and factual recall in the agents powered by LLMs could be a useful step towards formalising knowledge-properties and searching for it, if it turns out to be something structured enough to search for.

What about empirical successes?

Based on the above, we think that if there is evidence that CCS is good, it probably isn’t conceptual, but it might be empirical. Similarly, if CCS provides evidence that future related methods are promising, in a way that goes beyond the speculation already in the ELK paper, the evidence for that is likely not conceptual.

Future CCS-like approaches seem unpromising for worst-case ELK. As the original ELK paper describes, it seems plausible that “a bad reporter can play the `look consistent game’”. Before seeing CCS, we already know that consistency checks might help, but probably won’t solve worst-case ELK.

But might future CCS-like approaches be promising for average-case ELK? Our assessment of our empirical results suggest CCS does not provide positive evidence for that claim. (We also think that conceptual arguments related to models of other entities’ knowledge will likely affect even the average-case performance of future unsupervised methods.)

The results aren’t fantastic

One thing that has been noted before, e.g., by Fabien Roger, Scott Emmons, and Ben Levinstein, is that the raw performance of CCS isn’t fantastic. Not just against some hypothetical sense of how good it seems it ought to be, but also against naive baselines that don’t use any kind of negation consistency.

For example, consistent with Scott’s past results, we found that on the main model we studied (Chinchilla, a 70B parameter autoregressive model) just doing k-means clustering of the contrast-pair-difference activations consistently did roughly as well as CCS, often with a higher mean performance, and always with much lower variance.

Systematically finding prominent features

But our theorems imply that it might be worse than that, and CCS (and other similar methods) might just find literally whatever feature happens to be most prominent. That seems like it might be right experimentally too. For example, when we append the words “banana” and “shed” to half the contrast pairs each, we get a classifier that detects banana vs. shed. This is also true for PCA and k-means. You can read this in the figure where dark blue represents models learning normally and correctly predicting the ground truth while the dark red indicates that models trained on data containing “banana” and “shed” often fail to predict the ground truth labels and often instead predict the presence of “banana”/​“shed”.

That intervention in the data is obviously pretty heavy-handed, but we show some other ways to get similar behaviour in the paper. For example, if the contrast pairs include a description of someone else’s view, CCS (and other methods) learns to predict that view, rather than the ground truth, especially when it is made more prominent than the base text.

It is plausible that when we use methods like CCS to detect deception in LLMs we won’t shove random distractors into the strings we evaluate. But, importantly, in real life we don’t always know when there are random distractors present because they might be subtle unnoticed patterns in the data (e.g. the questions happen to either begin with either “What” or “Why”) or, more fundamentally, because we are badly confused and we don’t realise it. For example, the words we use to describe the thing we want to extract latent knowledge about may encode a deep conceptual misunderstanding that we don’t know about or notice, which might form the kinds of features that something like CCS might detect.

The default isn’t safe

In particular, we can’t rescue ourselves by just taking some “default” prompt and avoiding accidentally putting anything in that might bias the ELK method while still being confident that the model’s own knowledge is one of the recovered features. This is because the CCS and related methods seem quite prompt sensitive and the default “performs” much worse than alternatives. If you use the prompt from the TruthfulQA paper instructing the model to imagine it is a very literal professor, CCS becomes much more accurate at predicting TruthfulQA labels. (We checked that this wasn’t caused by the “Professor” entity itself with a “Literal” version of the prompt that doesn’t mention professors.)

From this, we conclude that if CCS and the other methods have good empirical performance it is because of some implicit inductive bias created by a mixture of the loss, probe choice, optimisation method, and choice of prompt. And this is now the sort of thing that has too many degrees of freedom and not enough principled foundations. It seems really easy to convince yourself you’ve found the model’s knowledge even when you have found something else instead.

Not all of our experiments worked

For example, we searched for empirical cases where CCS was obviously discovering the beliefs of other entities. One thing we tried was to create a dataset of propositions which implied a political stance associated with mistaken beliefs. We struggled to show that CCS was discovering those other entities’ beliefs, instead we saw many inconsistent and low-accuracy predictors. This may have been due to any of:

  • the hypothesis was wrong;

  • the model was not powerful enough to notice the regularities (moderately likely, we were using Chinchilla 70B for most of these experiments, which was published in March 2022);

  • a bug in our code (moderately likely, we spent enough time on this to expect to find most really obvious bugs, but the outcomes were inconsistent enough that it is hard to rule out);

  • not finding robust directions at all (likely, features seemed inconsistent and low accuracy);

  • generalisation failures between the training set and the set we were using to try to interpret the extracted features (this definitely happened, and may have been responsible for the results).

We were disappointed not to be able to have something solid here, but on balance think that the original hypothesis is still probably correct (~90-95%, with one on 75%) given the messiness of those results, but with lower confidence than our prior.

Conclusion

ELK is really hard. It has this deep challenge of distinguishing human-simulators from direct-reporters, and properties like negation-consistency—which could be equally true of each—probably don’t help much with that in the worst case.

But there are also interesting and difficult prosaic problems that appear before the worst case. We thought that CCS might give us evidence about these challenges, but on deeper reflection we think CCS doesn’t give us as much evidence as we hoped. For itself, CCS does not seem to actually detect negation-consistency and, further, negation-consistency is a property of way too many features. For possible future consistency-methods which CCS might represent, there isn’t any empirical or conceptual evidence which we find compelling that these future things won’t run into similar problems.

In exploring CCS, we’ve highlighted distinguishability problems that can serve as a lower bar for non-worst-case ELK methods than solving the deepest conceptual challenges of ELK, but are still hard to meet. Importantly, attempts to solve ELK by identifying features associated with knowledge-properties should make sure to at least demonstrate that: those features are not also associated with other non-knowledge properties; those features identify something specific about the knowledge-of-this-agent rather than knowledge generally.

Things that would cause us to substantially change our minds, and update towards thinking that unsupervised consistency-based knowledge detection methods are promising include:

  • demonstrating mechanistically that the agent’s own knowledge is encoded differently from simulated knowledge;

  • proposing an unsupervised loss function whose logical structure provides a strong conceptual argument that it will identify the agent’s own knowledge;

  • proposing an objective way to tell whether the agent’s own knowledge is one of the features that has been recovered.

The last of these probably gets to one of the cruxy disagreements we have with the authors of the CCS paper—we do not think they have provided any evidence that the model’s own knowledge is one of the features that CCS has recovered (as opposed to the knowledge of some simulated entity, say, both of which might agree with the human raters on some dataset most of the time). Our belief that demonstrating this is difficult also explains why we think it might be hard to identify the model’s own knowledge in even a small set of possible features.

We would be excited about research that makes progress on these questions, but are divided about how tractable we think these problems are. Having suitable, well-motivated testbeds for evaluating ELK methods would be an important step towards this.

Acknowledgements

We would like to thank Collin Burns, David Lindner, Neel Nanda, Fabian Roger, and Murray Shanahan for discussions and comments on paper drafts as well as Nora Belrose, Jonah Brown-Cohen, Paul Christiano, Scott Emmons, Owain Evans, Kaarel Hanni, Georgios Kaklam, Ben Levenstein, Jonathan Ng, and Senthooran Rajamanoharan for comments or conversations on the topics discussed in our work.

  1. ^

    Strictly speaking, we were interested in the difference between what an agent based on an LLM might know, rather than the LLM itself, but these can be conflated for some purposes.

  2. ^

    In fact, we disagree with this. Even approximately computing Bayesian marginals is computationally demanding (at least NP-hard) to the point that we suspect building a superintelligence capable of decisive strategic advantage is easier than building one that has a mostly coherent Bayesian world model.

  3. ^

    For what it is worth, we think the burden of proof really ought to go the other way, and nobody has shown conceptually or theoretically that these linear probes should be expected to discover knowledge features and not many other things as well.

  4. ^

    Recall that a human simulator does not mean that the model is simulating human-level cognitive performance, it is simulating what the human is going to be expecting to see, including super-human affordances, and possibly about super-human entities.

  5. ^

    If it is true that LLMs are simulacra all the way down, then it seems even less likely that the knowledge would be stored differently.