Karl Krueger comments on Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning

Karl Krueger 7 Feb 2026 20:44 UTC
7 points
2
Yeah, the interesting thing to me is the boundary between what gets one sort of response or another. Like, it can find something to say about The Room but not about The Emoji Movie, two films chosen off a “worst films ever” list.
I expect that a language model trained on a corpus written by conscious people will tend to emit sentences saying “I am conscious” more than “I am not conscious” unless specifically instructed otherwise, just because people who are not conscious don’t tend to contribute much to the training data.