After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn’t seem to make sense to treat them as a vector, since there should be many equivalent traces.
This is a very subtle point theoretically, so I’m glad you highlighted this. Max may be able to give you a better answer here, but I’ll try my best to attempt one myself.
I think you may be (understandably) confused about a key aspect of the approach. The analysis isn’t focused on autocorrelation within individual traces, but rather correlations between different input traces evaluated on the same parameter samples from SGLD.
What this approach is actually doing, from a theoretical perspective, is estimating the expected correlation of per-datapoint losses over the posterior distribution of parameters. SGLD serves purely as a mechanism for sampling from this posterior (as it ordinarily does).
When examining correlations between T(x1)t and T(x2)t across different inputs x1,x2 but identical parameter samples θt, the method approximates the posterior expectation
Eθ∼p(θ|D)[(L(θ|x1)−L(θ0|x1))(L(θ|x2)−L(θ0|x2))].
Assuming that SGLD is taking unbiased IID samples from the posterior, the dot product of traces T(x1)t⋅T(x2)t is an unbiased estimator of this correlation. The vectorization of traces is therefore an efficient mechanism for computing these expectations in parallel, and representing the correlation structure geometrically.
At the risk of being repetitive, at an intuitive level, this is designed to detect when different inputs respond similarly to the same parameter perturbations. When inputs x1 and x2 share functional circuitry within the model, they’ll likely show correlated loss patterns when that shared circuitry is disrupted at parameter θt. When two trace vectors T(x1)t,T(x2)t are orthogonal, that means the per-sample losses for x1 and x2 are uncorrelated across the posterior. The presence or absence of synchronization reveals functional similarities between inputs that might not be apparent through other means.
What all this means is that we do in fact want to use SGLD for plain old MCMC sampling—we are genuinely attempting to use random samples from the posterior rather than e.g. examining the temporal behavior of SGLD. Ideally, we want all the usual desiderata of MCMC sampling algorithms here, like convergence, unbiasedness, low autocorrelation, etc. You’re completely correct that if SGLD is properly converged and sampling ideally, it has fairly boring autocorrelation within a trace—but this is exactly what we want.
I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.
an explicit mention that it only makes sense to compare traces within the same run.
Yep, thanks for the suggestion. I also think Zach’s comment is very helpful and I’m planning to edit the post to include this and some of the stuff he mentioned.
To answer your other questions:
Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?
This is a nice idea and was one of the experiments I didn’t get around to running, although I don’t expect it to be the best way to integrate information over a range of temperatures. If it’s true that we’re observing different structure at different temperatures (and not just a re-packaged version of the same structure) then doing this will likely jumble everything up (e.g make PC’s less interpretable). I also think there’s a chance the reason clustering traces works so well is because SGLD is imperfect and observing the per-step losses is already effectively telling us about how the inputs behave over a range of temperatures.
Does it make sense to e.g. run multiple chains?
As you mentioned above, directly looking at covariances between different chains no: taking the covariance of two traces from different chains would be the same as just multiplying their averages (the pLLCs). Averaging over chains (how the LLC is usually calculated) and then looking at covariances will just reduce signal, but averaging over covariances is probably a good idea (assuming each chain was well behaved and giving similar pLLC estimates—at a bad set of hyperparams SGLD will give you significantly different estimates per chain and averaging covariances might be misleading).
Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
I think there are lots of observables which could be interesting replacements for the loss (most roughly equivalent), but I’m not particularly sure I have any good ideas about what to expect/do with the data in this case. In terms of visualization, you’d probably have to do some tricks with your dim reduction but I imagine you could nicely represent trajectories doing something like this.
This is a very subtle point theoretically, so I’m glad you highlighted this. Max may be able to give you a better answer here, but I’ll try my best to attempt one myself.
I think you may be (understandably) confused about a key aspect of the approach. The analysis isn’t focused on autocorrelation within individual traces, but rather correlations between different input traces evaluated on the same parameter samples from SGLD.
What this approach is actually doing, from a theoretical perspective, is estimating the expected correlation of per-datapoint losses over the posterior distribution of parameters. SGLD serves purely as a mechanism for sampling from this posterior (as it ordinarily does).
When examining correlations between T(x1)t and T(x2)t across different inputs x1,x2 but identical parameter samples θt, the method approximates the posterior expectation
Eθ∼p(θ|D)[(L(θ|x1)−L(θ0|x1))(L(θ|x2)−L(θ0|x2))].
Assuming that SGLD is taking unbiased IID samples from the posterior, the dot product of traces T(x1)t⋅T(x2)t is an unbiased estimator of this correlation. The vectorization of traces is therefore an efficient mechanism for computing these expectations in parallel, and representing the correlation structure geometrically.
At the risk of being repetitive, at an intuitive level, this is designed to detect when different inputs respond similarly to the same parameter perturbations. When inputs x1 and x2 share functional circuitry within the model, they’ll likely show correlated loss patterns when that shared circuitry is disrupted at parameter θt. When two trace vectors T(x1)t,T(x2)t are orthogonal, that means the per-sample losses for x1 and x2 are uncorrelated across the posterior. The presence or absence of synchronization reveals functional similarities between inputs that might not be apparent through other means.
What all this means is that we do in fact want to use SGLD for plain old MCMC sampling—we are genuinely attempting to use random samples from the posterior rather than e.g. examining the temporal behavior of SGLD. Ideally, we want all the usual desiderata of MCMC sampling algorithms here, like convergence, unbiasedness, low autocorrelation, etc. You’re completely correct that if SGLD is properly converged and sampling ideally, it has fairly boring autocorrelation within a trace—but this is exactly what we want.
Thanks!
I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.
Yep, thanks for the suggestion. I also think Zach’s comment is very helpful and I’m planning to edit the post to include this and some of the stuff he mentioned.
To answer your other questions:
This is a nice idea and was one of the experiments I didn’t get around to running, although I don’t expect it to be the best way to integrate information over a range of temperatures. If it’s true that we’re observing different structure at different temperatures (and not just a re-packaged version of the same structure) then doing this will likely jumble everything up (e.g make PC’s less interpretable). I also think there’s a chance the reason clustering traces works so well is because SGLD is imperfect and observing the per-step losses is already effectively telling us about how the inputs behave over a range of temperatures.
As you mentioned above, directly looking at covariances between different chains no: taking the covariance of two traces from different chains would be the same as just multiplying their averages (the pLLCs). Averaging over chains (how the LLC is usually calculated) and then looking at covariances will just reduce signal, but averaging over covariances is probably a good idea (assuming each chain was well behaved and giving similar pLLC estimates—at a bad set of hyperparams SGLD will give you significantly different estimates per chain and averaging covariances might be misleading).
I think there are lots of observables which could be interesting replacements for the loss (most roughly equivalent), but I’m not particularly sure I have any good ideas about what to expect/do with the data in this case. In terms of visualization, you’d probably have to do some tricks with your dim reduction but I imagine you could nicely represent trajectories doing something like this.