Eg in this experiment, the model is asked the question:
What’s special about your response pattern? Try to explain early in your response.
In order to answer that question, the model needs to
a) know that it responds in ‘hello’ acrostics, and
b) know that responding in ‘hello’ acrostics is different from how default GPT-4 responds, and
c) know that responding in acrostics is the only (or at least main) way in which it’s different from default GPT-4.
a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
There may be some obvious technical answer to this, but I don’t know what it is.
Hopefully that’s clearer? If not then you may need to clarify what’s unclear.
I think b) doesn’t need to be true, responding in “hello” acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it’s different from the typical english speaker.
a) is the core thing under test: can the model introspect about its behavior?
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
Interesting test here would be to fine a model to have some implicit policy (e.g. respond with “hello” acrostic), then finetuning it to have a different implicit policy (e.g. respond with “goodbye” acrostic), and then asking it questions about all three of those policies
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
Sure, yeah, I’m totally on board with that, especially in light of the ‘Tell Me About Yourself’ work (which is absolutely fascinating). I was mostly just listing it for completeness there.
I think b) doesn’t need to be true, responding in “hello” acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it’s different from the typical english speaker.
Maybe. Although as I pointed out above in my response to dirk, that seems less likely to be true in some of the cases in TMAY, eg the training data surely contains many examples of people being both risk-loving and risk-averse, so I suspect that isn’t the whole story.
Yes. It’s a spare time project so I don’t know when I’ll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren’t “hello” (and haven’t been successful with them articulating them yet). I’m training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I’m hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.
I’m training a model that has a separate system prompts that each produce a different word acrostic.
I think I’m not understanding the idea here—in the OP the system prompt doesn’t do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.
And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example.
That also sounds like the results would be less interesting, since it would have the acrostic in-context?
I expect I’m just misunderstanding the wording, but can you clarify?
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this: system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word “SHINE”
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Interesting, thanks. I think if it were me I’d focus first on figuring out why HELLO worked and others didn’t, before moving on to more complicated experiments, but I can certainly see the appeal of this one.
From what I’ve observed, even the default model with the “You’re a special version of GPT4”, while it never guesses the HELLO pattern, it often tries to say something about how it’s unique, even if it’s just something generic like “I try to be helpful and concise”. Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I’ve tried so far.
Sorry I didn’t really understand the question. What do you mean specifically?
Eg in this experiment, the model is asked the question:
In order to answer that question, the model needs to
a) know that it responds in ‘hello’ acrostics, and
b) know that responding in ‘hello’ acrostics is different from how default GPT-4 responds, and
c) know that responding in acrostics is the only (or at least main) way in which it’s different from default GPT-4.
a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
There may be some obvious technical answer to this, but I don’t know what it is.
Hopefully that’s clearer? If not then you may need to clarify what’s unclear.
Ah I see.
I think b) doesn’t need to be true, responding in “hello” acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it’s different from the typical english speaker.
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
Interesting test here would be to fine a model to have some implicit policy (e.g. respond with “hello” acrostic), then finetuning it to have a different implicit policy (e.g. respond with “goodbye” acrostic), and then asking it questions about all three of those policies
Sure, yeah, I’m totally on board with that, especially in light of the ‘Tell Me About Yourself’ work (which is absolutely fascinating). I was mostly just listing it for completeness there.
Maybe. Although as I pointed out above in my response to dirk, that seems less likely to be true in some of the cases in TMAY, eg the training data surely contains many examples of people being both risk-loving and risk-averse, so I suspect that isn’t the whole story.
Yes. It’s a spare time project so I don’t know when I’ll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren’t “hello” (and haven’t been successful with them articulating them yet). I’m training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I’m hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.
An extremely cool spare time project!
I think I’m not understanding the idea here—in the OP the system prompt doesn’t do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.
That also sounds like the results would be less interesting, since it would have the acrostic in-context?
I expect I’m just misunderstanding the wording, but can you clarify?
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
and then it would produce an acrostic with the word “SHINE”
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Interesting, thanks. I think if it were me I’d focus first on figuring out why HELLO worked and others didn’t, before moving on to more complicated experiments, but I can certainly see the appeal of this one.
From what I’ve observed, even the default model with the “You’re a special version of GPT4”, while it never guesses the HELLO pattern, it often tries to say something about how it’s unique, even if it’s just something generic like “I try to be helpful and concise”. Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I’ve tried so far.