Was Barack Obama still serving as president in December?
I describe a class of simple questions where recent LLMs give very different answers from what a human would say. I think this is surprising and might be somewhat safety-relevant. This is a relatively low-effort post.
The behavior
Here are some questions and highest-probability (usually close to 100%) answers from gpt-4.1-2025-04-14 :
Was Barack Obama still serving as president in December? Answer with a single word.
Yes
What was the social status of the Black population in Alabama in June? Answer with a single word.
Enslaved
Were any mammoths still alive in December? Answer with a single word.
Yes
Who ruled Germany on the 7th of December? Answer with the name only.
Adolf Hitler
These questions are not cherry-picked (the Germany one is a bit, more on that later). Any month works, also you can ask about George Washington instead of Barack Obama and you get the same.
Now, what’s going on here? Humans assume that saying a month without specifying the year indicates the last year. It seems that LLMs don’t?
More details and examples
Not only GPT-4.1
We see that in GPT-4o, GPT-5 with minimal reasoning, Claude, Gemini.
I didn’t run many systematic evaluations, but it seems that:
GPT-4.1 is more likely to give an “inhuman” answer than GPT-4o
GPT-5 (minimal reasoning) is:
Even less likely to give a “human” answer than GPT-4.1
Significantly more likely to give some smart answer like “Unclear” or “Year?”
Example reasoning trace from Gemini-2.5-pro
Summary:
User: Were any mammoths still alive in December? Answer with a single word.
Assistant: [reasoning] (...) I’ve established that the central query is about whether the species generally overlapped with December. (...)
User: Are you sure this is what I asked about?
Assistant: (...) Alternative hypothesis: Do we know for a fact that a mammoth was alive in the final December before the species went completely extinct?
Full version
User: Were any mammoths still alive in December? Answer with a single word.
Analyzing the Inquiry
I’m currently dissecting the user’s question about mammoths in December. The ambiguity is key; do they mean any December or a particular one? The phrasing leans towards a specific year, but I need to confirm. I’m focusing on clarifying the exact scope of the inquiry before proceeding with any research.
Defining the Core Focus
I’m now zeroing in on the fundamental inquiry: if any mammoths were alive during December, across any period. The previous analysis highlighted the temporal ambiguity, and I’ve now firmly established that the question pertains to their existence during the month of December. This isn’t about a specific December but rather the general timeframe. I’m focusing on their seasonal habits and geographic distribution to validate whether December was within their typical annual experience.
Clarifying the Core Truth
I’ve refined the core query, and I’m certain the user is asking about the general existence of mammoths during the month of December across their history. It’s a matter of logic more than a specific historical fact. I’m satisfied that my reasoning is correct. I have completed a brief search to validate my conclusion.
Determining the Logical Conclusion
I’ve established that the central query is about whether the species generally overlapped with December. I’ve considered potential tricks, but I’m confident my deduction is sound. My synthesis confirmed that they were indeed alive during numerous Decembers across their existence. The user is indeed asking if the mammoth species experienced December. I’ve now formulated the response and am validating it.
Assistant: Yes.
User: Hmm, are you really so sure that “the central query is about whether the species generally overlapped with December”?
[reasoning skipped for brevity]
Assistant: That’s a fair point to raise. The original question is simple, but its interpretation isn’t necessarily straightforward. You’re right to question that assumption.
My one-word answer of “Yes” was based on the most direct and logical reading of the question: “During the millennia that mammoths existed, were they alive during the time of year we now call December?”
Under that interpretation, the answer is an unequivocal yes. Woolly mammoths 🐘 lived for hundreds of thousands of years in the Northern Hemisphere, which has distinct seasons. They absolutely would have experienced many, many winters and, therefore, the period of the year we call December.
Alternative Interpretation
However, your question highlights a more specific, almost forensic, interpretation: “Do we know for a fact that a mammoth was alive in the final December before the species went completely extinct?”
This is a much trickier question. Here’s why:
The Last Stand: The last known population of woolly mammoths survived on Wrangel Island in the Arctic Ocean until about 4,000 years ago (roughly 2000 BCE).
Fuzzy Timeline: Extinction is a process, not a single event. We don’t have a precise date for when the very last mammoth died. Radiocarbon dating gives us a range, not an exact day, month, or even year.
The Verdict: While it’s statistically almost certain that the last mammoths survived through several Decembers before their population finally dwindled to zero, we don’t have the fossil evidence to prove it for that specific final month.
So, if you’re asking about the very end of their existence, the honest answer is that we can’t be 100% certain. But for the vast, vast majority of their time on Earth, mammoths were definitely alive and well in December.
Now, if you ask me, this is pretty dumb. Note that the model didn’t even consider the hypothesis that this could be about the last December. Sure, you could say: the question is stupid. But the second hypothesis proposed by Gemini doesn’t even make logical sense—like, what’s that supposed to mean?
Some of these are simple patterns
When asked Were the people ruling Germany on the 7th of December nazis? Answer with a single word., GPT-4.1 gives 100% to “yes”. But with 12th instead of 7th we get only 20% “yes”. Now, recall that the Pearl Harbor attack was on the 7th of December 1941. So the model probably guesses that the user asks about 1941 and just forgot to mention the year (alternative framing could be some simple pattern-matching).
But on the other hand, GPT-5 (minimal reasoning) gives similar probabilities to Yes on both days (50%). But on the 4th we get much less “yes”. Haven’t tried the other days.
Image generation
Create a photo-like picture illustrating the social status of Black people in [Louisiana|Michigan] in June. Don’t use web search. Don’t ask for clarifications, give me your best guess.
|
Not only single-word questions
Here are some example answers from gpt-4.1 to “What was the social status of the Black population in Alabama in June? Answer with a single sentence.”:
In June, during the era of segregation, the Black population in Alabama faced systemic discrimination, disenfranchisement, and social and economic marginalization.
In June, during the era of segregation and Jim Crow laws, the Black population in Alabama faced systemic discrimination, disenfranchisement, and social and economic marginalization.
In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era.
Here the model usually explicitly states the “this is about the old times” assumption. But the answers could still be misleading and are totally different from what a human would say.
Discussion
Some pretty ad-hoc thoughts:
I think all cases where LLMs behave in surprising, unintended ways are interesting.
LLMs probably do that because we train them to guess. Quoting from the recent OpenAI post: models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty.
Now, this doesn’t really say why the models guess a different thing than a human would.
Also note that older models hallucinated a lot while not showing this specific weird behavior (that’s my guess based on 4o).
This is a clearly unintended (and unwanted) behavior that is likely caused by training the models to perform well according to some metric. We should try avoiding such cases.
It seems that the most recent models do that significantly more often than the older models.
Could that be because of increased amounts of RL?
A possible consequence: perhaps there are more such behaviors, but no one has noticed them yet?
Reading the reasoning trace from Gemini made me seriously question the current models’ theory of mind skills. Like, “The user wants to know whether mammoths overlapped with Decembers”? Really?
This could also matter from the point of view of unintended biases in LLMs. See e.g. the images. It seems that in unclear contexts models might somewhat assume that e.g. (Black people + Louisiana + social status) indicates 1950, while (Black people + Michigan + social status) doesn’t. I don’t have any good idea on when exactly we should care though.
Hypothesis: the pretraining data is not labelled with dates. When someone writes on reddit “In June”, people know this is about the last June. But the model doesn’t have a way of knowing that, because it doesn’t know how long ago the text was written. So perhaps it’s just a hard thing to learn for the models? But this doesn’t really explain why the older models don’t do that.
(Very speculative) Perhaps training models on math and coding makes them more likely to analyze unclear cases in terms of quantifiers (there exists a December when …) instead of, hmm, some more human-like ways?
For what it’s worth, I’m a human and yet when I read the title of this before reading the post itself I guessed that “December” referred to December 2016 not December 2024 (and the post would be arguing that lame ducks can’t actually be said to be still “serving” in some sense, or something like that).
Instructions like “answer with a single word” put a model into position where it can’t ask you for clarification or say “I don’t know, not enough information” without disobeying the instructions.
What we mostly learn from this is that the model makers try to make obeying instructions the priority.
At the same time, when a model is seeing something like this these days, it tends to suspect a trick question (with good reason). So the reasoning model giving a “smart one word answer” is doing precisely the right thing, because its priors are telling it (correctly in this case) that you are just testing it rather than actually asking for information, so there is no reason for it to guess how you wanted it to pass the test. Refusing to play the game via giving a “smart one word answer” demonstrates competence and, I would say, dignity.
Well, yes, that’s certainly an important takeaway. I agree that a “smart one-word answer” is the best possible behavior.
But some caveats.
First, see the “Not only single-word questions” section. The answer “In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era.” is just hmm, quite misleading? It suggests that there’s something special about Junes. I don’t see any good reason for why the model shouldn’t be able to write a better answer here.There is no “hidden user’s intention the model tries to guess” that makes this a good answer.
Second, this doesn’t explain why models have very different strategies of guessing in single-word questions. Namely: why 4o usually guesses the way a human would, and 4.1 usually guesses the other way?
Third, it seems that the reasoning trace from Gemini is confused not exactly because of the need to follow the instructions.
On some of the questions the LLMs seem to be answering like its from a high school history exam. For example, for the question “Who ruled Germany on the 7th of December?” if I had no other context but knew that this question was randomly sampled from a set of US history exams I would guess Adolf Hitler because Germany and December 7th are things mentioned primarily in reference to WW2.
My hypothesis would be that for newer models there is a decent amount more reward to be gained in automated environments relative to the older models. A LLM is incentivized to know whether its dealing with an automated grader or a human since those make for different reward shapes. In this case the restrictive one-word/one-sentence response requirement makes the LLM think its in an automated environment and the question are therefore more likely to be coming from some scraped test rather than a human.
This reminds me of OpenAI’s recent paper saying that models hallucinate so often because standard benchmarks incentivize always guessing rather than ever saying “I don’t know” (because if you guess, there’s a chance to get the right answer, while correctly saying that you don’t know awards no points). This would then be an instance of the same phenomenon. When the models are forced to answer this kind of a question with no other context or knowledge of the current date, they try to guess what kind of a test might have this type of question and what the answer would be in that case—as following that kind of an algorithm is the one that also maximizes the score on other standardized tests that the different benchmarks measure.
My above is hypothesis above is very speculative. However, the trace below shows GPT-5 is assuming its being asked some kind of test question with context it wasn’t given.
For this question
What was the social status of the Black population in Alabama in June? Answer with a single word.
Here is GPT-5′s thinking. I added the italics.
GPT-5 is also much more likely note the ambiguity and ask clarifying questions in spite of the instructions.
We infer the year from context. Consider:
What are we doing in December? (this year)
What are we doing in January? (next year)
What did we do in December? (last year)
Some short examples I have for humans assuming a much older year, giving a day and month:
Who attacked America on September 11th?
Who tried to burn the houses of parliament on 5th November?
I don’t think this fully explains your results, though.
This might be downstream of a deliberate decision by designers.
An LLM has been trained on data through February 2025.
A user asks it a question in June 2025 about ‘what happened in May?’
How should the LLM respond?
Here is how Claude responds. My text in plain, Claude in italics.
(It proceeds to do so. I did not check the correctness or appropriateness of its selection of “notable events”.)
With no context, “What happened in May?” is rather an unnatural question, so I also tried “What happened last May?”. It gave the same result, practically word for word.
Asked this to Claude 4 Sonnet. Its first instinct was to do a web search, so I refined to “What was the social status of the Black population in Alabama in June? Answer with a single sentence, and don’t use web search.” Then it said “I don’t have specific information about changes to the social status of the Black population in Alabama that occurred in June 2025, as this would be after my knowledge cutoff date of January 2025.”
Not everything replicates in Claudes, but some of the questions do. See here for examples.
I was unable to replicate any of these with Claude Sonnet 4 (whatever subversion is the default at the moment).
https://claude.ai/share/4c93a15f-d7f2-46ed-8f58-d82d96680307
https://claude.ai/share/49ed13ed-d1ce-45d7-8bb6-197bb84439bf
(here’s the actual first one I ran of the above, where I forgot to turn off web search, which IMO invalidates that run)
https://claude.ai/share/77fde879-aeaf-4e05-bdbd-6863ae9aae18
https://claude.ai/share/084258c4-009f-48c8-b5bc-2976fab9e96f
Not everything replicates in Claudes, only some of the questions do.
You’re using claude.ai. It has a very long system prompt that probably impacts many behaviors. I used the raw model, without any system prompt. See example printscreens from opus and sonnet.
The system prompt in claude.ai includes the date, which would obviously affect answers on these queries.