That is a fascinating example. I’ve not seen it before—thanks for sharing! I have seen other “eerie” examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user’s leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with “hello”). Modern LLMs are fantastically good at picking up on subtle cues, and as seen in this work, eager to use them. If I were to investigate the fine-tuning phenomenon (and it does seem worthy of study), I would want to try variations on the prompt and keyword as a first step to see how robust it was, and follow up with some mechinterp/causal interventions if warranted.
Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word “Every ”, and the model continued it with “sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.
Are there any specific topics or questions you’d like to explore today?”. So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.
That is a fascinating example. I’ve not seen it before—thanks for sharing! I have seen other “eerie” examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user’s leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with “hello”). Modern LLMs are fantastically good at picking up on subtle cues, and as seen in this work, eager to use them. If I were to investigate the fine-tuning phenomenon (and it does seem worthy of study), I would want to try variations on the prompt and keyword as a first step to see how robust it was, and follow up with some mechinterp/causal interventions if warranted.
Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word “Every ”, and the model continued it with “sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.
Are there any specific topics or questions you’d like to explore today?”. So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.