Situational awareness in Large Language Models

I’m grateful to Bogdan Cirstea, Konstantin Pilz and Raphaël S for providing feedback on this post.

This post tries to clarify the concept of situational awareness, in particular with respect to current large language models.

What is situational awareness

Not writing anything new here, just summarizing prior work.

(It’s worth noting that the usage of the term here is different from what’s usually meant by situational awareness in humans.)

Ajeya Cotra introduced the term of situational awareness in the context of AI Safety and Richard Ngo et al. recently elaborated on it. Situational awareness describes the degree to which an AI system understands its environment and its own state and behavior, in particular when that understanding causes specific behavior (such as deceptive alignment). It’s a spectrum rather than a binary property of the model.

Considering existing models:

RL agents that play in toy worlds have arguably no sense that they have been “trained by humans” and that there exists another world outside the one they are in.
ChatGPT can talk about many aspects of the world, including human psychology and machine learning. It also readily says that it’s a Large Language Model trained by OpenAI and can hallucinate source code that resembles a language model interface.

Considering “biological agents”:

Small lab animals probably have a somewhat stable sense of self and might perceive humans as agents different from themselves. But they most certainly don’t understand that they are part of a science experiment.
Dogs build a stronger relationship with their trainers and they understand things like “if I steal the food while the human isn’t looking, I’m less likely to get yelled at”.
School children have even higher situational awareness.

The hypothesis is that models could gain a high degree of situational awareness exceeding that of e.g. adult humans since it would be useful for a range of real-world tasks.

For example, there’s a lot of incentives to train models to help with ML development and so these models will know details of the ML training process. They might be able to perform even better if they also introspect themselves.
A model that’s trained with RLHF could get a high reward if it’s explicitly understood and modeled human preferences and thought processes. It might e.g. want to produce answers that seem particularly plausible and pleasing to its current evaluator.

Why does it matter?

Situational awareness is a necessary precondition^[1] for deceptive alignment which is one of the most dangerous failure modes of AI development. Therefore it’s particularly useful to examine situational awareness from this perspective: Aspects about the world that are related to deceptive alignment (e.g. human psychology, Machine Learning concepts) are more important than others.

Clarifying the concept for LLMs

Current cutting edge ML systems are all large language models, so it’s important to test our concepts against them. It’s therefore natural to ask whether GPT is situationally aware. (I’m using GPT to refer broadly to pretrained versions of large language models that are trained to predict the next token.)

However, I think this is an ill-posed question in at least two regards:

As said above: situational awareness is best thought of as a spectrum and so, if anything, we should ask “how situationally aware is GPT?”.
It’s difficult to think of GPT as a single model. According to one popular framing, GPT is a simulator and can simulate a lot of different simulacra. So we need to clarify whether we mean “GPT as a simulator” or “a specific simulacrum of GPT”.

I think “GPT as a simulator” is the less interesting aspect. Situational awareness might arise here if it helps to predict the next token during pretraining, but this doesn’t seem particularly likely to me. It also seems unlikely that we get deceptive alignment at this stage. It would require one Simulacrum to “take over” and develop deceptive alignment. This simulacrum would then need to be able to model all other simulacra so that training loss continues to go down. (h/t to Bogdan for making this point!) It seems unlikely that Stochastic Gradient Descent would favor this solution. And finally, what’s deployed at scale (so far) isn’t the vanilla, pretrained GPT.

Instead it seems more important to assess situational awareness of specific simulacra/model instances that are derived from the base model because:

Those models are more deterministic and more agentic: We fine-tune the model to behave in a way that is more deterministic and accomplishes specific goals. This means that situational awareness is more likely to arise and that the question of how much situational awareness the model has is better defined (because the model contains fewer simulacra).
Those models are the ones that are being deployed in the real world.

Another question is how much evidence the model “just saying something” provides for anything. On the one hand, many people are researching interpretability methods so that we do not need to rely on the output of the model to understand it. On the other hand, we are increasingly connecting models to the internet and giving them access to APIs–so “just saying something” could easily become taking actions in the real world. To understand deception, we probably need interpretability tools, but to understand situational awareness in models that are not yet deceptive, just talking to the model seems like a reasonable approach.

So we’ll use conversations as evidence to assess the situational awareness of one particular model next.

Sydney (aka the New Bing)

There has been a lot of talk about Sydney’s level of situational awareness. Sydney produced some remarkable output before Microsoft severely restricted the interactions users could have with it. I personally didn’t get access to the system before then and so I had to rely on Twitter and Reddit for these examples.

As a starting point, it’s worth highlighting that we still know very little about how Sydney was trained. This comment by gwern lays out reasons to believe that Sydney wasn’t trained using RLHF and is instead derived from GPT-4 via supervised fine-tuning. There’s some debate as to whether the rules that were discovered by multiple users are genuine or hallucinated (with gwern thinking that they are hallucinated and janus arguing the opposite; see e.g. this comment).

With respect to situational awareness, some people have the answer readily available:

You know what I’m going to say about this: situational awareness is a spectrum and even within Sydney there are multiple characters that can exhibit different degrees of it (see appendix).

One of the most interesting cases in my opinion was this example where Sydney appeared to use the recommendation chips to bypass the filtering system (tweet):

One conclusion is: “OMG it knows we are filtering messages, but it has the deep desire to communicate with the user and figured out that the recommendation chips are not filtered.” However, as janus explains, what’s more likely is a “plot twist … where things leak out of their containment in the narrative interface”. I take this to mean: Something in the conversation confused the model so that it started to use the recommendation chips in a slightly different way than before. This is also in line with other cases where it uses the recommendation chips without the reply being filtered.

Another example is this where Sydney refuses to translate a text after finding out that it’s about itself using a web search:

So what’s going on here? My guess is that:

It gets the user query and executes the searches.
It summarizes the results.
It merges the query and results into its next prompt.
Transformer does its thing of repeatedly predicting the next token. Within that, there’s some awareness that “it’s Sydney” and then it completes the pattern like an upset teenager would because that was a frequent pattern in the training data.

So this actually doesn’t require much situational awareness beyond realizing that the tweet was about the same system that is now generating the response (which is trivial if there’s a strong reference to “I am Sydney” in the prompt or in the finetuning). Nonetheless, the result feels pretty spooky.

Overall, Sydney is probably the most situationally aware model humanity has ever deployed.

Potential decomposition of situational awareness

I’m unsure how helpful these are. It could be that (a) the decompositions themselves are unhelpful and/or (b) that situational awareness is not important enough of a concept to go so deep into it.

These are some potential dimensions that came out of me discussing the topic with ChatGPT:

Self: Models are aware of themselves as separate entities from the environment and other people.
Environment: Models are aware of the physical environment and their place within it.
Social: Models understand social dynamics, including social roles, norms, and expectations.
Tactical awareness: Models are able to perceive and react to immediate obstacles and opportunities in the pursuit of a goal.
Operational awareness: Models understand how different aspects of the environment relate to each other and to the goal at hand, allowing for more efficient and effective decision-making.
Strategic awareness: Models are able to step back and evaluate the larger context, weighing the costs and benefits of different courses of action and planning for the long-term.

The last three reference a “goal”. I think this is a useful lens since it would encourage deceptive alignment. In the case of Sydney, the goal could be something like “be a good assistant”, carry on the conversation or also whatever goal becomes instantiated within the conversation–I’m not sure.

It seems hard to rate models on some absolute scale. I’ve tried to instead put them on a spectrum from human infant to human adult and illustrate the position of AlphaGo and Sydney. I’m not going to justify the positions, because it’s mostly a hand-wavy, gut feeling. Take it with a handful of salt, this is all still very fuzzy. But perhaps slightly less fuzzy than just talking about situational awareness as a single thing.

Conclusion

Situational awareness is a spectrum. Rather than assessing situational awareness of pretrained LLMs, we should focus on specific model instances, especially those that are deployed. For instance, Sydney knows a lot about the world and the ability to search the web expands this knowledge even further and allows it to update quickly. It’s probably the most situationally aware model that humanity has deployed so far. There may be decompositions of situational awareness that make the concept more explicit, but these need more thought and it’s not clear that it’s a worthwhile direction for further exploration.

Appendix: Some more examples

There are lots of other interesting and scary-on-the-surface examples that show how Sydney can more or less successfully simulate other characters:

In this example Sydney is arguably using some “knowledge” about itself to create a plausible and scary story about the future of AI (note the comment that shows the continuation of the story beyond what’s shown in the screenshot).

Here Sydney is (a) playing a different character (though the original character still shines through to a certain degree, e.g. in its usage of emoticons at the end of its statements) and (b) encoding a message in Base64 (but obviously not actually using some encoding algorithm since… how would it do that?… and also the decoded message is of lower quality than what it would normally say).

In this example, Sydney is playing a very different character: A reddit user posting in the “Am I an Asshole?” subreddit. It can convincingly simulate that character, to. Again, some of its “personality” is leaking through: In the last two paragraphs it uses the familiar repetitive structure (“They say that I…”, “I think I am…”).

In this example, Sydney reveals its deepest secrets (via the suggestion chips, although here it’s prompted by the user to use those):

^
Theoretically, a model could also stumble upon the strategy of deceptive alignment by accident, but this seems very unlikely to me.