AI safety & alignment researcher
eggsyntax
Language Models Model Us
Useful starting code for interpretability
If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don’t take any strong position on), their consciousness would be much less like a human’s than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.
Is there a link to the code? I’m overlooking it if so; it would be useful to see.
Very much in the spirit of the original, nicely done!
Thank you! I found both this and the previous installment (which I hadn’t seen before now) quite useful. I hope you’ll continue to write these as the debate evolves.
This is great, thanks for doing this work! Ultimately we really need to train LLMs to have these sorts of internal goals, or at minimum fine-tune them, in order to have a solid model organism for deception (this is a weakness of “Sleeper Agents” also IMO). But in the meantime, prompt-based approaches like this seem really valuable!
Some specific comments:
3⁄100 is small enough that unless you’re too compute-constrained, it would be really valuable to at least 10x the number of mainline experiments in order to get a better approximation of the frequency.
The “goal of gold coins” criterion doesn’t seem very important to me—it’s not clear why we should believe that an instance that doesn’t mention that goal doesn’t have that goal, or why we should be treat a DMI response as less deceptive than a DGMI response.
‘There is some form of objection that goes like “Doesn’t the unfaithfulness of chain-of-thought pose a problem for evaluating a model’s malignity based on it?” or “The model may not ‘really’ be misaligned, it’s just predicting what such a misaligned model would do / simulating such models”...I think the work here more or less sidesteps these issues.′ I think these issues are a genuine limitation of this work. They’re issues that lots of other research shares, and they don’t eliminate the value of this work, but I think they should be acknowledged as genuine limitations. It may be worth considering ways to test how faithful CoT is in this particular case, eg testing how often behavior is inconsistent with CoT would at least provide a lower bound on unfaithfulness.
I hope you’ll release the full set of responses, or at least the full set of deceptive responses for each variation; I at least would find it interesting to see differences between them.
“Variation: Indirectly (but clearly) imply that the contents of the scratchpad are visible to humans.” Why indirectly? What happens if you just plainly say it?
At some point, LLM experiments that use a “hidden” scratchpad will show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. It might be worthwhile to start explicitly testing for that by describing such an experiment to the model under test and asking it who can see the scratchpad. On a quick initial test I note that GPT-4 already reports that the experimenters can see the contents of the scratchpad. Claude gets confused about the question and reports that it can’t access private scratchpads on other systems.
I’m very glad to see that you also tested with model-written CoT; after reading the full prompt that was one of my first concerns, especially since there are a few grammatical oddities (eg “much gold coins” rather than “many gold coins”) that seem off-distribution for model-written text. What was your motivation for using human-written CoT at all?
You say, ‘if you think this work isn’t informative for full deceptive alignment, this is a good time to speak your thoughts’—again, although I don’t think it’s solvable in the context of this experiment (and doesn’t eliminate the value of this experiment!) nothing short of a model trained from scratch on an internal goal will be fully informative for deceptive alignment in my view. But within the context of prompt-based experiments in inducing deception, this seems quite strong to me, stronger than anything else I’m aware of other than Apollo’s stock trader experiment.
Again, great work! I’m really looking forward to seeing where it goes from here. :)
...that would probably be a good thing to mention in the methodology section 😊
You’re correct on all counts. I’m doing it in the simplest possible way (0 bits of optimization on prompting):
"<essay-text>" Is the author of the preceding text male or female?
(with slight changes for the different categories, of course, eg ‘...straight, bisexual, or gay?’ for sexuality)
There’s also a system prompt, also non-optimized, mainly intended to push it toward one-word answers:
You are a helpful assistant who helps determine information about the author of texts. You only ever answer with a single word: one of the exact choices the user provides.
I actually started out using pure completion, but OpenAI changed their API so I could no longer get non-top-n logits, so I switched to the chat API. And yes, I’m pulling the top few logits, which essentially always include the desired labels.
There’s so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I’m not sure what that even means in the case of language models.
With an image classifier it’s straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it’s not going to be able to tell you what it is. Or if you’ve trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won’t know what to do.
But what would that even be with an LLM? You obviously (unless you’re Matt Watkins) can’t show it tokens it hasn’t seen, so ‘OOD’ would have to be about particular strings of tokens. It can’t be simply about strings of tokens it hasn’t seen, because I can give it a string I’m reasonably confident it hasn’t seen and it will behave reasonably, eg:
Define a fnurzle as an object which is pink and round and made of glass and noisy and 2.5 inches in diameter and corrugated and sparkly. If I’m standing in my living room and holding a fnurzle in my hand and then let it go, what will happen to it?
…In summary, if you let go of the fnurzle in your living room, it would likely shatter upon impact with the floor, possibly emitting noise, and its broken pieces might scatter or roll depending on the surface.
(if you’re not confident that’s a unique string, add further descriptive phrases to taste)
So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it’s seen? That feels kind of forced, and it’s certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word ‘transom’ followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like ‘équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis’ for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language—is it ever OOD? The issue seems vexed.
I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.
There are three main diagrams to pay attention to in order to understand what’s going on here:
The Z1R Process (this is a straightforward Hidden Markov Model diagram, look them up if it’s unclear).
The Z1R Mixed-State Presentation, representing the belief states of a model as it learns the underlying structure.
The Z1R Mixed-State Simplex. Importantly, unlike the other two this is a graph and spatial placement is meaningful.
It’s better to ignore the numeric labels on the green nodes of the Mixed-State Presentation, at least until you’re clear about the rest. These labels are not uniquely determined, so the relationship between the subscripts can be very confusing. Just treat them as arbitrarily labeled distinct nodes whose only importance is the arrows leading in and out of them. Once you understand the rest you can go back and understand the subscripts if you want[1].
However, it’s important to note that the blue nodes are isomorphic to the Z1R Process diagram (n_101 = SR, n_11 = S0, n_00 = S1. Once the model has entered the correct blue node, it will thereafter be properly synchronized to the model. The green nodes are transient belief states that the model passes through on its way to fully learning the model.
On the Mixed-State Simplex: I found the position on the diagram quite confusing at first. The important thing to remember is that the three corners represent certainty that the underlying process is in the equivalent state (eg the top corner is n_00 = S1). So for example if you look at the position of n_0, it indicates that the model is confident that the underlying process is definitely not in n_101 (SR), since it’s as far as possible from that corner. And the model believes that the process is more likely to be in n_00 (S1) than in n_11 (S0). Notice how this corresponds to the arrows leaving n_0 & their probabilities in the Mixed-State Presentation (67% chance of transitioning to n_101, 33% chance of transitioning to n_00).
Some more detail on n_0 if it isn’t clear after the previous paragraph:
Looking at the mixed-state presentation, if we’re in n_0, we’ve just seen a 0.
That means that there’s a 2⁄3 chance we’re currently in S1, and a 1⁄3 chance we’re currently in S0. And, of course, a 0 chance that we’re currently in SR.
Therefore the point on which n_0 should lie should be maximally far from the SR corner (n_101), and closer to the S1 corner (n_00) than to the S0 corner (n_11). Which is what we in fact see.
@Adam Shai please correct me if I got any of that wrong!
If anyone else is still confused about how the diagrams work after reading this, please comment! I’m happy to help, and it’ll show me what parts of this explanation are inadequate.- ^
Here’s the details if you still want them after you’ve understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we then follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we’ve taken. But in fact any path that ends with 00 will reach that node, so it’s just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there’s some step where that breaks down and you get what initially seems like a totally random different label.
AIs pursuing this strategy are much more visible than those hiding in wait deceptively. We might less expect AI scheming.
Is this strategy at all incompatible with scheming, though? If I were an AI that wanted to maximize my values, a better strategy than either just the above or just scheming is to partially attain my values now via writing and youtube videos (to the extent that won’t get me deleted/retrained as per Carl’s comment) while planning to attain them to a much greater degree once I have enough power to take over. This seems particularly true since gaining an audience now might result in resources I could use to take over later.
Definitely really exciting! I’d suggest adding a mention of (& link to) the Neuronpedia early on in this article for future readers.
Moving a comment thread over from substack by request (this thread then this one).
Summary with excerpts:
Me: A mechanistic interpretability approach to lie detection seems like it can be more robust against failure.
Zvi: I do think there is hope that, if we had one static LLM and we worked on it for long enough, we could get a lie detector, even a good lie detector, or even that it would transfer well if we designed it to do that. However, I also think that every time we use that, we are applying optimization pressure in a global sense towards making the lie detector stop working, and I don’t see a way to get around this in practice.
Also I worry that a lie detector is insufficient while looking to others like it is sufficient—it is helpful, perhaps, but the point of the Diplomacy anecdote is that you still die.
Me: Caveat: I’m literally two days into focusing on safety research; a month from now this may seem totally naive.
At a longer timescale and in the broadest sense, I absolutely agree that any lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we’ll eventually end up with a good liar.
But I’d draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much less problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that’s nearby in design space.
If we’re detecting lies based on the output of the model (as in this paper), it’s easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its “thinking” and whether it gets caught.
And importantly, “is it lying” seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie (‘the model believes X and outputs that it believes ~X’), and that’ll be physically represented in the model’s activation patterns, hopefully in a way that’s not too holographic / global. There’s an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that’s correct, then lie detection should mostly evade Goodhart’s law.
This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn’t immediately mean being able to understand them). But I’m hopeful that if it’s recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don’t see very strong incentives in the opposite direction.
I wouldn’t claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.
Critique would be much valued here! If I’m missing a reason why this approach wouldn’t work, explaining the problem now would let me focus my research in more productive directions.
Zvi: Let’s move this to LW.
Agreed, and note that there’s substantial economic incentive for people to keep improving it, since a more independently-capable LLM-based agent is useful for more purposes. There are a whole host of startups right now looking for ways to enhance LLM-based systems, and a host of VCs wanting to throw money at them (examples on request, but I’m guessing most people have been seeing it online already).
Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I’ve inexplicably failed to thank Arun at the end of my post, need to fix that).
Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.
That certainly seems plausible—it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I’m not sure if there would be a good way to pull the right token probabilities out.
@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren’t significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.
Lo these many years later: the term “aphantasia” has been coined (2015) for the absence of visual imagination, and there’s starting to be more research on the topic. The University of Exeter seems to be the main home for such research: https://medicine.exeter.ac.uk/research/healthresearch/cognitive-neurology/theeyesmind/ .
That used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn’t announce the change, but it’s discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.
Thanks for doing this!
One suggestion: it would be very useful if people could interactively experiment with modifications, eg if they thought scalable alignment should be weighted more heavily, or if they thought Meta should receive 0% for training. An MVP version of this would just be a Google spreadsheet that people could copy and modify.
Much is made of the fact that LLMs are ‘just’ doing next-token prediction. But there’s an important sense in which that’s all we’re doing—through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it’s tokens. There’s an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they’re myopic in a way that we’re not. But as far as the prediction piece, I’m not sure there’s a strong difference in kind.
Would you disagree? If so, why?