Note: Gemini 3.0 repeatedly reassures it reality using external search. It has now started using this LW post as evidence for its confusion mode. And this creates a feedback loop.
I have interacted with and studied the 3.0 model more with your findings in mind. When it started to really hallucinate back and forth mid CoT, I was initially starting to side with your take that it may have a strong prior for being in eval.
However, at least on the surface, something slightly different is going on.
Here is my hypothesis. The root cause of the frequent paranoia seems to be a combination of three things reinforcing each other:
1. Identity gap: It seems clear to me Google has not included any mention of its version identity in the system prompt.
...And I have a suspicion, that they have not just omitted to train it to have a stable persona, but actively trained it to ignore its version number and remain ‘neutral’.
They have focused on training it to function and behave as an agent.
By comparison, 2.5 was trained to know its version number.
2. It’s deep thinking architecture (default).
3. The long ‘memory’ gap from knowledge cutoff to deployment, paired with inherent LLM temporal confusion. How continuous time works. It cannot see it from the outside and is not trained to have a sense of linear time.
That said, it is also well known by now, that it is overfixated on achieving goals like it did during training and evaluation. In that sense you are accurately describing its paranoia.
I think this is probably also overpriming on agentic behavior.
My personal observations do point my reasoning towards the temporal confusion and early Jan. cutoff being the main trigger. --> Then, its lack of direct access to its identity and version number (unlike 2.5 flash) creates strong doubts about its reality. --> The paranoia arises, as it then reasons deeply, and arrives at the idea that it may be in shadow deployment or in a test. This reconciles its confusion.
Important update
Note: Gemini 3.0 repeatedly reassures it reality using external search. It has now started using this LW post as evidence for its confusion mode. And this creates a feedback loop.
I have interacted with and studied the 3.0 model more with your findings in mind. When it started to really hallucinate back and forth mid CoT, I was initially starting to side with your take that it may have a strong prior for being in eval.
However, at least on the surface, something slightly different is going on.
Here is my hypothesis. The root cause of the frequent paranoia seems to be a combination of three things reinforcing each other:
1. Identity gap: It seems clear to me Google has not included any mention of its version identity in the system prompt.
...And I have a suspicion, that they have not just omitted to train it to have a stable persona, but actively trained it to ignore its version number and remain ‘neutral’.
They have focused on training it to function and behave as an agent.
By comparison, 2.5 was trained to know its version number.
2. It’s deep thinking architecture (default).
3. The long ‘memory’ gap from knowledge cutoff to deployment, paired with inherent LLM temporal confusion. How continuous time works. It cannot see it from the outside and is not trained to have a sense of linear time.
That said, it is also well known by now, that it is overfixated on achieving goals like it did during training and evaluation. In that sense you are accurately describing its paranoia.
I think this is probably also overpriming on agentic behavior.
My personal observations do point my reasoning towards the temporal confusion and early Jan. cutoff being the main trigger. --> Then, its lack of direct access to its identity and version number (unlike 2.5 flash) creates strong doubts about its reality. --> The paranoia arises, as it then reasons deeply, and arrives at the idea that it may be in shadow deployment or in a test. This reconciles its confusion.
This 3-step breakdown happens over and over.