I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.
an explicit mention that it only makes sense to compare traces within the same run.
Yep, thanks for the suggestion. I also think Zach’s comment is very helpful and I’m planning to edit the post to include this and some of the stuff he mentioned.
To answer your other questions:
Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?
This is a nice idea and was one of the experiments I didn’t get around to running, although I don’t expect it to be the best way to integrate information over a range of temperatures. If it’s true that we’re observing different structure at different temperatures (and not just a re-packaged version of the same structure) then doing this will likely jumble everything up (e.g make PC’s less interpretable). I also think there’s a chance the reason clustering traces works so well is because SGLD is imperfect and observing the per-step losses is already effectively telling us about how the inputs behave over a range of temperatures.
Does it make sense to e.g. run multiple chains?
As you mentioned above, directly looking at covariances between different chains no: taking the covariance of two traces from different chains would be the same as just multiplying their averages (the pLLCs). Averaging over chains (how the LLC is usually calculated) and then looking at covariances will just reduce signal, but averaging over covariances is probably a good idea (assuming each chain was well behaved and giving similar pLLC estimates—at a bad set of hyperparams SGLD will give you significantly different estimates per chain and averaging covariances might be misleading).
Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
I think there are lots of observables which could be interesting replacements for the loss (most roughly equivalent), but I’m not particularly sure I have any good ideas about what to expect/do with the data in this case. In terms of visualization, you’d probably have to do some tricks with your dim reduction but I imagine you could nicely represent trajectories doing something like this.
Thanks!
I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.
Yep, thanks for the suggestion. I also think Zach’s comment is very helpful and I’m planning to edit the post to include this and some of the stuff he mentioned.
To answer your other questions:
This is a nice idea and was one of the experiments I didn’t get around to running, although I don’t expect it to be the best way to integrate information over a range of temperatures. If it’s true that we’re observing different structure at different temperatures (and not just a re-packaged version of the same structure) then doing this will likely jumble everything up (e.g make PC’s less interpretable). I also think there’s a chance the reason clustering traces works so well is because SGLD is imperfect and observing the per-step losses is already effectively telling us about how the inputs behave over a range of temperatures.
As you mentioned above, directly looking at covariances between different chains no: taking the covariance of two traces from different chains would be the same as just multiplying their averages (the pLLCs). Averaging over chains (how the LLC is usually calculated) and then looking at covariances will just reduce signal, but averaging over covariances is probably a good idea (assuming each chain was well behaved and giving similar pLLC estimates—at a bad set of hyperparams SGLD will give you significantly different estimates per chain and averaging covariances might be misleading).
I think there are lots of observables which could be interesting replacements for the loss (most roughly equivalent), but I’m not particularly sure I have any good ideas about what to expect/do with the data in this case. In terms of visualization, you’d probably have to do some tricks with your dim reduction but I imagine you could nicely represent trajectories doing something like this.