Another perspective would be too look at the activations of an autoregressive deep learning model, e.g. a transformer, during inference as a stochastic process: the collection of activation (Xt) at some layer as random variables indexed by time t, where t is token position.
One could for example look at mutual information between the history X−t=(Xt,Xt−1,...) and the future of the activations Xt+1, or look at (conditional) mutual information between the past and future of subprocesses of Xt (note: transfer entropy can be a useful tool to quantify directed information flow between different stochastic processes). There are many information-theoretic quantities one could be looking at.
If you want to formally define a probability distribution over activations, you could maybe push forward the discrete probability distribution over tokens (in particular the predictive distribution) via the embedding map.
In the context of computational mechanics this seems like a useful perspective, for example to find belief states by optimizing mutual information between some coarse graining of the past states and future states to find belief states in a data-driven way (this is too vague stated like that, and i am working on a draft that get into more details about that perspective).
Another perspective would be too look at the activations of an autoregressive deep learning model, e.g. a transformer, during inference as a stochastic process: the collection of activation (Xt) at some layer as random variables indexed by time t, where t is token position.
One could for example look at mutual information between the history X−t=(Xt,Xt−1,...) and the future of the activations Xt+1, or look at (conditional) mutual information between the past and future of subprocesses of Xt (note: transfer entropy can be a useful tool to quantify directed information flow between different stochastic processes). There are many information-theoretic quantities one could be looking at.
If you want to formally define a probability distribution over activations, you could maybe push forward the discrete probability distribution over tokens (in particular the predictive distribution) via the embedding map.
In the context of computational mechanics this seems like a useful perspective, for example to find belief states by optimizing mutual information between some coarse graining of the past states and future states to find belief states in a data-driven way (this is too vague stated like that, and i am working on a draft that get into more details about that perspective).