Just an update. So far, nothing interesting has happened.
I’ve got some more thorough tests I’m working on in my spare time. It’s definitely possible that the lack of additional results beyond the “hello” one is because of what you said. In the original experiment by @flowersslop (which didn’t have the “hello” greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the “hello” greeting, I still get third line correct responses as well.
But I haven’t had any luck with any less common words yet. I’m still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I’m trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I’ll see if the model can articulate that final pattern without the training data to help it.
If there is any emergent meta-awareness (which I’ve now seen a couple of papers hinting at something similar) happening here, I’m hoping this can coax it out of the model.
This is fascinating. Thanks for investigating further. I wonder if you trained it on a set of acrostics for the word “HELL” or “HELMET”, it might incorrectly state that the rule is that it’s spelling out the word “HELLO”.
Just an update. So far, nothing interesting has happened.
I’ve got some more thorough tests I’m working on in my spare time.
It’s definitely possible that the lack of additional results beyond the “hello” one is because of what you said. In the original experiment by @flowersslop (which didn’t have the “hello” greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the “hello” greeting, I still get third line correct responses as well.
But I haven’t had any luck with any less common words yet. I’m still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I’m trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I’ll see if the model can articulate that final pattern without the training data to help it.
If there is any emergent meta-awareness (which I’ve now seen a couple of papers hinting at something similar) happening here, I’m hoping this can coax it out of the model.
This is fascinating. Thanks for investigating further. I wonder if you trained it on a set of acrostics for the word “HELL” or “HELMET”, it might incorrectly state that the rule is that it’s spelling out the word “HELLO”.