Francis Bacon writes in Novum Organum, where he introduces the modern scientific method, that contrary to the Greeks who started with big theories that explained lots of things, the way to do science is to start by tracking particulars. Lots and lots of particulars, and trying only generalize theories over them bit by bit. He does this with his trying to figure out heat, he lists all the different things that seem to be heat-like...
Well, this post gives me a lot of that vibe. Part of me was holding out for the juicy theorizing for what’s going on, but I appreciate the extensive cataloguing here. It strikes me as a bit tedious, but as good ol’ Bacon taught us, starting from lots and lots of particulars is how you get to theories that actually hold.
Beyond that, I think “abnormal psychology” is fascinating: you find out how things work but observing the ways in which they break. This is true of humans, why not true of LLM agents too? Alongside glitch tokens, I feel like exposing (and replicating) these attractor states is interesting. One thought I have is that these states show these models aren’t yet agents the way humans are, the a fast second thought is perhaps humans have loops and attractor states just as much – they’re just harder to notice from the inside.
Thank you! I was very tempted to do some theorizing but I think a lot of the value would likely come from showing that this is an interesting area with models doing weird things.
My current theory for what’s going on here is something like the model reducing to “base model” state where the attractor states shift it enough off distribution to erode its fine-tuning/post training chat assistant format conceptions and it then strongly emphasizes it’s base model behaviours. In Olmo, for example, attractor states earlier on in fine-tuning looked a lot like just the base model ie. It was repeating tons of tokens and overtime they got more coherent.
There’s a good chance that attractor states “look like” the model being reduced to base model behaviour but there’s something more interesting going on here.
Francis Bacon writes in Novum Organum, where he introduces the modern scientific method, that contrary to the Greeks who started with big theories that explained lots of things, the way to do science is to start by tracking particulars. Lots and lots of particulars, and trying only generalize theories over them bit by bit. He does this with his trying to figure out heat, he lists all the different things that seem to be heat-like...
Well, this post gives me a lot of that vibe. Part of me was holding out for the juicy theorizing for what’s going on, but I appreciate the extensive cataloguing here. It strikes me as a bit tedious, but as good ol’ Bacon taught us, starting from lots and lots of particulars is how you get to theories that actually hold.
Beyond that, I think “abnormal psychology” is fascinating: you find out how things work but observing the ways in which they break. This is true of humans, why not true of LLM agents too? Alongside glitch tokens, I feel like exposing (and replicating) these attractor states is interesting. One thought I have is that these states show these models aren’t yet agents the way humans are, the a fast second thought is perhaps humans have loops and attractor states just as much – they’re just harder to notice from the inside.
Interesting stuff, though. Kudos!
Thank you! I was very tempted to do some theorizing but I think a lot of the value would likely come from showing that this is an interesting area with models doing weird things.
My current theory for what’s going on here is something like the model reducing to “base model” state where the attractor states shift it enough off distribution to erode its fine-tuning/post training chat assistant format conceptions and it then strongly emphasizes it’s base model behaviours. In Olmo, for example, attractor states earlier on in fine-tuning looked a lot like just the base model ie. It was repeating tons of tokens and overtime they got more coherent.
There’s a good chance that attractor states “look like” the model being reduced to base model behaviour but there’s something more interesting going on here.