To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.
Intuitively I would say that all the tokens in the token window are the state.
And when you run an inference pass, select a token and append that to the token window, then you have a new state.
The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model.
Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input.
The state seems to contain, for lack of a better word, a lot of entanglement. Likely due to attention heads, and how the nonlinear functions are encoded.
There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to “align(?)” something like Bing Chat is impossible (at least if the goal is bulletproof).
I would postulate:
- Alignment for such a system is impossible (assuming it has to be bulletproof)
- Impossibility is due to the architecture of such a system
Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system 0=F(xt,˙xt) then x is the state, not the trajectory of states (x1,…,xt). In this situation, the dynamics of the system only depend on the current state (that’s because it’s a first-order system). When we move to higher-order systems, 0=F(xt,˙xt,¨xt), then the state is still just x, but the dynamics of the system but also the “direction from which we entered it”. That’s the first derivative (in a time-continuous system) or the previous state (in a time-discrete system).
At least I think that’s what’s going on. If someone makes a compelling argument that defuses my argument then I’m happy to concede!
Calling individual tokens the ‘State’ and a generated sequence the ‘Trajectory’ is wrong/misleading IMO.
I would instead call a sequence as a whole the ‘State’. This follows the meaning from Dynamical systems.
Then, you could refer to a Trajectory which is a list of sequence each with one more token.
(That said, I’m not sure thinking about trajectories is useful in this context for various reasons)
To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.
Intuitively I would say that all the tokens in the token window are the state.
And when you run an inference pass, select a token and append that to the token window, then you have a new state.
The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model.
Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input.
The state seems to contain, for lack of a better word, a lot of entanglement. Likely due to attention heads, and how the nonlinear functions are encoded.
There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to “align(?)” something like Bing Chat is impossible (at least if the goal is bulletproof).
I would postulate:
- Alignment for such a system is impossible (assuming it has to be bulletproof)
- Impossibility is due to the architecture of such a system
I assume that any bit in the input affects the output, and that a change in any parameter has potential impact on that bit.
If anyone want to hear about it, I would be happy to explain my thinking. But be aware the abstraction and mapping I used was very sloppy and ad hoc.
Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system 0=F(xt,˙xt) then x is the state, not the trajectory of states (x1,…,xt). In this situation, the dynamics of the system only depend on the current state (that’s because it’s a first-order system). When we move to higher-order systems, 0=F(xt,˙xt,¨xt), then the state is still just x, but the dynamics of the system but also the “direction from which we entered it”. That’s the first derivative (in a time-continuous system) or the previous state (in a time-discrete system).
At least I think that’s what’s going on. If someone makes a compelling argument that defuses my argument then I’m happy to concede!