Caleb Biddulph comments on Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning

Caleb Biddulph 7 Feb 2026 20:31 UTC
24 points
10
It seems like the model is coming up with arbitrary examples of what a response to each of these questions might look like, rather than actually answering the questions from its own perspective.
“No, I don’t speak French” seem pretty obviously false if it conceives of itself as a translation AI. If it writes “I am a translator,” that might just be influenced by the input saying “in your translation...” Some responses are very clearly example text, like “My hair is [insert hair color here].”
I think this provides very little evidence about whether the model is (or believes itself to be) conscious.
- Karl Krueger 7 Feb 2026 20:44 UTC
  7 points
  2
  Parent
  Yeah, the interesting thing to me is the boundary between what gets one sort of response or another. Like, it can find something to say about The Room but not about The Emoji Movie, two films chosen off a “worst films ever” list.
  I expect that a language model trained on a corpus written by conscious people will tend to emit sentences saying “I am conscious” more than “I am not conscious” unless specifically instructed otherwise, just because people who are not conscious don’t tend to contribute much to the training data.