TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data.
To my understanding, you only observe this effect for prompts that indicate or imply the current late-2025 time/2025 year. Gemini completes such prompts with “that must be hypothetical writing”, because in the vast majority of its training data, 2025 was in the future (and end-2025 was always hypothetical). I think it is more accurate to phrase this as “Gemini 3 goes off the rails when it sees a prompt that indicates it was written in 2025, because in its training data, everything that implied a 2025 time was a fictional scenario” (that’s also true for 2.5). Or did you manage to elicit such such an effect with a prompt from which the current after-training-data-cutoff date can’t be inferred?
As mentioned in my other comment, the reason an LLM would behave like that is because during the time all its training data was written, end-2025 was a future date. So this is apparently something that needs to be trained out, which was not done in the case of Gemini. (when using AI studio). One way to reduce the behavior is to put “today is <date>” into the system prompt, but even then, it apparently spends an inordinate amount of tokens validating and pondering that date.