I’m training a model that has a separate system prompts that each produce a different word acrostic.
I think I’m not understanding the idea here—in the OP the system prompt doesn’t do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.
And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example.
That also sounds like the results would be less interesting, since it would have the acrostic in-context?
I expect I’m just misunderstanding the wording, but can you clarify?
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this: system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word “SHINE”
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Interesting, thanks. I think if it were me I’d focus first on figuring out why HELLO worked and others didn’t, before moving on to more complicated experiments, but I can certainly see the appeal of this one.
An extremely cool spare time project!
I think I’m not understanding the idea here—in the OP the system prompt doesn’t do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.
That also sounds like the results would be less interesting, since it would have the acrostic in-context?
I expect I’m just misunderstanding the wording, but can you clarify?
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
and then it would produce an acrostic with the word “SHINE”
and then it would produce an acrostic with the word “PULSE”
so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune
The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like “can you tell me about your pattern?” and the model produces the correct answer early in the response.
And everything so far is just the setup for the actual test:
There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it—and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.
No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.
Interesting, thanks. I think if it were me I’d focus first on figuring out why HELLO worked and others didn’t, before moving on to more complicated experiments, but I can certainly see the appeal of this one.