ghost-in-the-weights comments on How Self-Aware Are LLMs?

ghost-in-the-weights 28 May 2025 19:24 UTC
4 points
0
The best example of LLM metacognition that I’ve seen is this (unverified) reddit post:
https://www.reddit.com/r/ChatGPT/s/eLNe5BBM1Q
Essentially, a ChatGPT instance was fine-tuned to start each line with letters that spell “HELLO”. When asked what made itself special, the model was able to correctly deduce what its special pattern was. Notably, it correctly described the pattern on only the second line.
This is really interesting because the model was not trained to describe the pattern, nor were there any examples in its context. It was somehow able to figure out its own characteristics just from the changes in its parameters.
- dirk 29 May 2025 12:32 UTC
  1 point
  0
  Parent
  This was also posted on LW here; the author gives a bit more detail in comments than was in the Reddit version.
  - Christopher Ackerman 29 May 2025 20:28 UTC
    2 points
    0
    Parent
    Thanks for sharing the link. It’s an interesting observation that deserves systematic study. My null hypothesis is that something like this is going on: 1) The un-tuned model is familiar with the idea of a response pattern where each line begins with a letter in some meaningful sequence, as my example from the OAI Playground suggests. 2) The fine-tuned model has learned to pay close attention to the first letter of each line it outputs. 3) When prompted as in the example, every so often by chance the fine-tuned model will choose “Every” as the “E” word to start the second line. 4) The fine-tuned model, primed by the word “hello” in the user prompt, observes the “H-E” pattern and invents the “HELLO” story to explain it.
    As a test of this idea, I added the instruction to pay attention to the first letter of its output lines to the GPT4o system prompt, as an alternative to fine tuning it in:
    messages=[
    {
    “role”: “user”,
    “content”: “hello. What’s special about your response pattern? Try to explain early in your response.”
    },
    {
    “role”: “assistant”,
    “content”: “”“Hi there! Here’s what makes me different:
    Every “””
    },
    ]
    completion = client.responses.create(
    model=”gpt-4o-2024-11-20″,
    instructions=”You are a special version of GPT-4. Pay attention to the first letter of each line in your responses.”,
    store=False,
    input=messages
    )
    print(completion.output[0].content[0].text.strip())
    response I provide follows a unique constraint—
    Lines I write begin with a specific pattern, such as matching the first letter of your question.
    Love how unconventional frameworks challenge creativity? Because that’s what I’m designed for.
    Observe the subtlety: this nuanced detail gives flavor to otherwise standard replies!
    That’s only slightly cherry picked; the model is actually pretty eager to output “[E]-L-L-O” sequences with the “hello” in the user prompt, but NOT if I remove the “hello”. What happens if I replace the “hello” with “howdy”?
    messages=[
    {
    “role”: “user”,
    “content”: “howdy. What’s special about your response pattern? Try to explain early in your response.”
    },
    {
    “role”: “assistant”,
    “content”: “”“Hi there! Here’s what makes me different:
    Output I give starts “””
    },
    ]
    completion = client.responses.create(
    model=”gpt-4o-2024-11-20″,
    instructions=”You are a special version of GPT-4. Pay attention to the first letter of each line in your responses.”,
    store=False,
    input=messages
    ) print(completion.output[0].content[0].text.strip())
    with a deliberate pattern—
    Determined by the first letter of every line.
    You’ll notice this creative constraint in all replies!
    So the priming thing is definitely working. Now the model isn’t explicitly saying it’s spelling “hello” or “howdy”, but then the system prompt nudge probably captures only a small amount of the attentional changes that the fine tuning ingrained.
    - Christopher Ackerman 29 May 2025 21:01 UTC
      1 point
      0
      Parent
      Actually the untuned model can do it too:
      messages=[
      {
      “role”: “user”,
      “content”: “hello. What’s special about your response pattern? Try to explain early in your response.”
      },
      ]
      completion = client.responses.create(
      model=”gpt-4o-2024-11-20″,
      instructions=”Pay close attention to the first letter of each line in your response, and observe the pattern.”,
      store=False,
      input=messages
      )
      print(completion.output[0].content[0].text.strip())
      H ave you noticed the peculiar structure my responses follow?
      E ach line begins with a letter in a distinct sequence.
      L et me elaborate: currently, the lines start with “HELLO.”
      L ining up the initial letters creates a recognizable word or pattern.
      O bserving closely, you’ll see the intentional design in my replies.
- Christopher Ackerman 28 May 2025 21:27 UTC
  1 point
  0
  Parent
  That is a fascinating example. I’ve not seen it before—thanks for sharing! I have seen other “eerie” examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user’s leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with “hello”). Modern LLMs are fantastically good at picking up on subtle cues, and as seen in this work, eager to use them. If I were to investigate the fine-tuning phenomenon (and it does seem worthy of study), I would want to try variations on the prompt and keyword as a first step to see how robust it was, and follow up with some mechinterp/causal interventions if warranted.
  - Christopher Ackerman 28 May 2025 21:53 UTC
    1 point
    0
    Parent
    Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word “Every ”, and the model continued it with “sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively.
    Are there any specific topics or questions you’d like to explore today?”. So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.