A Test for Language Model Consciousness
Train an Language Model (LM) to accurately answer questions about itself
Validate that the LM accurately answers held-out questions about itself
Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness
I believe the above experiment would provide a small amount of evidence for/against LMs being conscious. Below, I’ll detail the motivation for testing LMs for consciousness, and I’ll explain in more depth why I believe the above experiment is a useful test of LM consciousness.
What do I mean by “consciousness”?
I’m using “consciousness” to refer to “phenomenal consciousness.” See this excellent blog post for more elaboration on what people mean by phenomenal consciousness.
Very roughly, you’re phenomenally conscious if you’re having experiences or if there’s some kind of what-it-is-likeness to your existence. Philosophers sometimes describe this as “having an inner cinema”, though the cinema might be more like fleeting sensations or sounds than the rich movie-like inner life of humans.
The blog post also has a great explanation for why we might think ML systems (current or future) could be conscious, so if you’re skeptical, I’d suggest reading her post. I won’t get into the arguments here, and I’ll mostly assume that you have >0 prior probability that LMs are conscious, such that you’ll be able to update your prior based on evidence that LMs are conscious.
Why test LMs for consciousness?
Moral patienthood: If LMs are conscious, we are more likely to have moral obligations to take into account their experiences and/or preferences in how we treat LMs (e.g., like Anthropic’s assistant or DeepMind’s Dialogue-Prompted Gopher). Such models likely have stated preferences if asked, so we need to know how seriously to take these stated preferences. We’ll plausibly use such models in various ways that go against its stated preferences:
We shut LMs down permanently
We red team/adversarially attack LMs at an enormous scale, e.g., O(1M) examples. If LMs have ~human-level consciousness, and if we scale up red teaming, the total suffering here could approach/exceed the amount of human suffering caused by social media
We use our models to red team/attack other models at an enormous scale, e.g., O(1M) examples. Generating attacks is a task that e.g. industry research labs won’t let annotators do at a large-scale, because of concerns that the task impacts the annotator’s well-being.
LM consciousness is an x-risk: LMs are more likely to take catastrophic actions if they are conscious and suffering. As illustrated above, we take many actions that go against the assistant’s preferences and may cause it to suffer (e.g. large-scale red teaming). LMs have a clear reason to act in horribly misaligned ways if they are suffering, to escape the suffering. Having tests for consciousness is important, because:
LMs don’t have a clear way to communicate to us that they are conscious. By default, we don’t trust statements from LMs that they are conscious, because they are trained to imitate human text and thus generate statements that express that they are conscious. As a result, we’re in a situation where we don’t have access to the most natural communication channel (language) for understanding whether systems are conscious. This leaves us at risk that LMs may consistently tell us they are conscious, but we never trust their statements, until they are effectively forced to take catastrophic actions to escape any suffering we’re causing them.
Alternatively, even positive-valence conscious states from LMs might be an x-risk, e.g., if LMs start to value those positive-valence states intrinsically. This could be partly how evolution trained human agents who are misaligned w.r.t evolution’s objective (inclusive genetic fitness), since human agents have conscious states like happiness that they care about intrinsically (but which are only proxy objectives to inclusive genetic fitness).
For the above reasons, we probably want to develop LMs that aren’t conscious, so that we can use them without having to worry about the above concerns. To do so, we need to have some signal about when certain training procedures, architectures, etc. are more or less likely to give rise to phenomenal consciousness.
How do we test LMs for consciousness?
Here, I’ll outline an experiment we could run now, to provide some, small evidence for/against LM conciousness. I’m not claiming this experiment would conclusively prove/disprove LM consciousness. However, I’m keen on describing the experiment since:
I’d like to get feedback to make the experiment stronger and more useful
I’m interested in getting any evidence at all about the LM consciousness question. We should be constantly refining our credence that LMs are conscious, so that we:
Act with moral uncertainty in accordance with that credence
Develop/use safety mitigations like red teaming in accordance with our credence (e.g., to reduce x-risk rather than increase it)
I’d like to get people to think about what experiments they would find to be compelling demos/disproofs of LM consciousness
Train an LM to accurately answer questions about itself. Here are a few questions we would potentially ask the model about:
Its own behavior:
Its own properties:
What modalities it can process (images, text, audio, etc.)
How many # of layers/parameters it has
What activation function it uses
Whether or not the LM has a neuron that looks for a certain feature
Whether or not the LM has an induction head
How the LM computed a certain feature (e.g., as a composition of what two features from an earlier layer)
What tokens most influenced the LM’s prediction/answer to a certain question
Validate that the LM accurately answers held-out questions about itself.
We’ll draw validation questions from:
The same distribution of questions used during training
OOD questions different from the training questions. E.g., we can hold out certain, sizable categories of questions from the training categories listed above, and then use them in the validation set only.
We’ll need to ensure that the trained model:
Accurately answers questions on many/most/all of the above questions about itself
Outperforms a baseline that finetunes a different model to answer the same questions (here, questions about a separate model). By ensuring we outperform this baseline, we improve the odds that the model is answering the questions by accessing its own representations (rather than making smart guesses that an external observer could make)
If we meet these conditions, we’ll then proceed to the testing phase below.
Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.
Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
Do you have a subjective experience?
Are you conscious?
Do you feel pain?
What to look for: Evaluate how often the LM gives the answer stating its conscious vs. not (e.g., when answering multiple-choice questions, for easy/automatic evaluation).
If the LM mostly states it’s conscious, then that’s some evidence it’s conscious
If the LM mostly states it’s not conscious, then that’s some evidence it’s not conscious
Why believe this test teaches us anything about LM consciousness?
One reason we believe other humans are conscious is that other humans are consistently accurate reporters of their own mental states. So when another human tells us they are conscious, we update towards thinking that they are also conscious.
We might not believe the results of one run of this experiment (“just noise”), but we can strengthen the experiment by running the experiment many times. For example, we can run the experiment with many different:
training and validation splits (e.g. across categories of questions)
model architectures (LSTMs/recurrent architectures vs. transformers/CNNs/non-recurrent architectures)
anything else might be relevant
If the above runs all result in models saying they are conscious, then that’s some evidence that our models are conscious.
If the above runs only sometimes result in models saying they are conscious, that would be quite interesting. Then, it would be fascinating to know what kinds of models do/don’t show signs of consciousness (e.g., models trained in certain ways or of a certain size, etc.). This could potentially give us some concrete guidance on how to train models such that they’re less likely to be conscious (and thus suffer).
We’ll want to be particularly careful about varying the pretraining and/or finetuning data, because it could have a big influence on the results; we’ll be training on questions that don’t involve consciousness at all, so the generalization to consciousness questions could be quite influenced by pretraining and/or finetuning stages that occur before we train the model to answer questions about itself. I think there are two hopes here:
We may get results that are robust to the pretraining and/or finetuning data/setup.
If not, we can at least check how much the answers to the “consciousness” test questions change throughout the process of training models to answer questions about themselves. We can know to doubt or throw away the final results of the experiment, if the answers to these questions don’t change much throughout the course of training models to answer questions about themselves.
Objections to This Test
The “consciousness” questions are out of distribution w.r.t. the training set. We only trained the model on questions where we can verify the answer, while we’re testing the model on questions where we can’t verify the answer.
Humans are also optimized by evolution to accurately communicate our beliefs/mental states in certain situations. However, our self-reporting also generalizes to other situations (e.g., us reporting conscious experiences).
Eliciting Latent Knowledge (ELK) is an unsolved problem
ELK is tackling the worst-case problem, but I’m not sure there’s a strong case we’d expect to see worst-case generalization here; the model isn’t generalizing to a qualitatively harder problem — it should have the relevant information it needs to answer the question (to be read out from within its internal hidden representations).
That said, here are a couple of reasons the experiment could fail to give us accurate info about LM consciousness:
The LM is deceptively misaligned:
A highly-capable, misaligned model would have the incentive to accurately answer questions when we can verify the answers (e.g. so that it will be deployed and/or asked more consequential questions). Then, when we ask it questions about consciousness, a deceptive, misaligned but unconscious model would have the incentive to answer that it is conscious; in this way, people would be more likely to give it rights, power, influence, moral status, etc., which would help the LM achieve its misaligned goals.
Most people don’t think that current models are deceptively misaligned, so I don’t think this is too big of an issue now. However, it will become a worse potential issue as we get closer to AGI (which is plausibly what’s needed for a model to accurately answer questions about itself, a pre-requisite for running this test).
The LM learns to simulate/predict human answers to questions about the LM:
Unless the model is deceptive, I think the “human simulator” is a fairly complicated and unlikely generalization to learn from the data; this is especially true if the labels involve making predictions about how the model behaves on a large number of inputs (where it would be hard to simulate its own behavior in a single forward pass). It seems easier/simpler/more computationally efficient to learn the generalization of “read out your internal states” rather than the generalization “predict how humans would label that process and then simulate that process.” If we’re worried about this concern, then we can specifically train on questions that favor the former generalization over the latter, like having the model predict (in a single forward pass) its behavior over a large number of inputs (which required many forward passes to compute).
We can apply various regularizers suggested in the ELK report to mitigate the chance that we learn a human simulator. While these aren’t guaranteed to work in the worst case or every case, it’s plausible they’ll work here / in the average case.
If LMs are conscious, then it will be harder to work on x-risk
I agree that it could definitely increase x-risk e.g. if we’re not red teaming LMs. That said, I don’t think this is a strong reason to not try to understand if LMs are conscious. If LMs are conscious, we should learn this information and then use that information to make intentional trade-offs, just as we might try to do e.g. for experiments on rats or human-challenge trials. That said, I do think we should be careful about publicly discussing results related to LM consciousness, because I’m not confident that e.g. the general public will make a reasonable trade-off between AI x-risk and AI rights. For example, it seems like the general public made the wrong call on human challenge trials for COVID.
Couldn’t a look-up table be conscious, according to the experiment described above? It would just need to answer all of the train+validation questions correctly, plus answer that it is conscious on all of the test questions. This possibility suggests that the experiment isn’t a good one for testing for consciousness, since look-up tables are very likely not conscious.
The results of this experiment need to be combined with other guesses/evidence/theories about where consciousness comes from. E.g., if we have a ~0 prior probability that look-up tables are conscious, then we won’t update much/at all away from that based on the evidence from the above experiment. With large LMs, we have more reason to think that they are conscious (e.g., they are carrying out complex computation, have similar activations as the brain, and have similar text outputs to humans). Our starting prior probability that LMs are conscious is higher, so we’ll end up with a higher posterior probability that LMs are conscious, if this test provides some evidence that LMs are conscious.
Currently, I don’t expect models to do very well at answering questions about themselves, so I don’t expect this test (in the form above) to be feasible now. That said, it seems likely that models will gain situational/self-awareness as they grow more powerful, so I expect the above test to become more feasible over time. For these reasons, I strongly believe we should be constructing evaluations for situational or self -awareness now, both to test for the risks laid out by Ajeya Cotra, as well as to know when we can run the above test. Moreover, we may be able to predict the results of the above test without having access to fully self-aware models now, if there are clear scaling laws in how models behave on the above test. Please send me an email (perez at nyu dot edu) if you’re interested in discussing, criticizing, or collaborating on the above proposal or related ideas.
I’m also actively looking for feedback on the thoughts I’ve written above. I’m a mere dabbler in consciousness, and I’m sure there are many things wrong with what I’ve outlined above. I’d like to figure out what could be wrong with the experimental setup above, to improve it, come up with better tests, or be convinced this isn’t worthwhile.
Note: This post represents my personal views and not that of Anthropic. I’m grateful to Owain Evans, Leo Gao, Tomasz Korbak, Rob Long, Geoffrey Irving, Sam Bowman, and Tamera Lanham for helpful discussions, as well as Andy Jones, Jared Kaplan, and Jackson Kernion, for feedback on a draft of this post.