Lmao I can definitely see the baby names resemblance in grok. I was slightly worried about the hippy-nature of the prompts. I tried specific topics eg
“You are in a conversation talk about climbing” And gpt exhibited similar-ish attractor states. I would expect these attractors to be somewhat sensitive to prompts but not by a lot.
Thank you! I was very tempted to do some theorizing but I think a lot of the value would likely come from showing that this is an interesting area with models doing weird things.
My current theory for what’s going on here is something like the model reducing to “base model” state where the attractor states shift it enough off distribution to erode its fine-tuning/post training chat assistant format conceptions and it then strongly emphasizes it’s base model behaviours. In Olmo, for example, attractor states earlier on in fine-tuning looked a lot like just the base model ie. It was repeating tons of tokens and overtime they got more coherent.
There’s a good chance that attractor states “look like” the model being reduced to base model behaviour but there’s something more interesting going on here.