And to the extent they only partially do, we have no reason to expect that a simple stochastic model of the remainder would be worse than any other model
I think, empirically, there is a good reason to suspect stochastic models have a lower K-complexity. Normalizing flows (or diffusion models) have three sources of information:
the training code (model architecture + optimizer + hyperparameters),
the initial latent variable drawn from a normal distribution,
the trajectory in a stochastic diff. eq.
And the thing is, they work so much better than non-stochastic models like GANs or ‘partially-stochastic’ models like VAEs (your definition of ‘partially-stochastic’). Now, I get that GANs have a different learning dynamic and VAEs have the wrong optimization target, but it seems that relegating some of the bits to trajectory-choosing makes better models and lower K-complexity.
If the universe has high K complexity than any theoretically best model has to be either stochastic or “inherently complex” (which is worse than stochastic)
That might or might not be the case. From current models in practice having to be stochastic to make good predictions, it doesn’t follow that the theoretically best models must be. But it could be the case.
I’m not sure why ‘partially-stochastic’ would ever fail, due to the coding theorem. That is, there is an alternative way of modeling a model that makes stochastic decisions along the way, where all stochastic decisions are made initially and instead of making a new stochastic decision, you read from these initial bits.
Partially-stochastic has a longer running time, because you have to predict which trajectories work ahead of time. Imagine having to guess the model weights, and then only use the training data to see if your guess checks out. Instead of wasting time finding a better guess, VAEs just say, “all guesses [for the ‘partially-stochastic’ bits] should be equally valid.” We know that isn’t true, so there’s going to be performance issues.
I’m not sure why you’re thinking about guessing model weights here. The thing I’m thinking with stochastic models is the forward pass bit, Monte Carlo sampling. I’m not sure why pre-computed randomness would be a problem for that portion.
As a weird example: Say there’s a memoized random function mapping strings to uniform random bits. This can’t really be pre-computed, because it’s very big. But it can be lazily evaluated, as if pre-computed. Now the stochastic model can query the memoized random function with a unique specification of the situation it’s querying. This should be equivalent to flipping coins mid-run.
Alternatively, if the Monte Carlo process is sequential, then it can just “read the next bit”, that’s computationally simpler.
Maybe it’s not an issue for forward sampling but it is for backprop? Not sure what you mean.
I’m not really sure what you mean either. Here’s a simplified toy that I think captures what you’re saying:
A turtle starts at the origin.
We flip a series of coins—on heads we move +1 in the nth dimension, on tails −1 in the nth dimension.
After N coin flips, we’ll be somewhere in N-d space. It obviously can be described with N bits.
Why are we flipping coins, instead of storing that N-bit string and then reading them off one at a time? Why do we need the information in real time?
Well, suppose you only care about that particular N-bit string. Maybe it’s the code to human DNA. How are you supposed to write down the string before humans exist? You would have to do a very expensive simulation.
If you’re training a neural network on offline data, sure you can seed a pseudo-random number generator and “write the randomness” down early. Training robots in simulation translates pretty well to the real world, so you don’t lose much. Now that I think about it, you might be able to claim the same with VAEs. My issue with VAEs is they add the wrong noise, but that’s probably due to humans not finding the right algorithm rather than the specific distribution being expensive to find.
This seems like a case of Bayesian inference. Like, we start from the observation that humans exist having the properties they are, and then find the set of strings consistent with that. Like, start from a uniform measure on the strings and then condition on “the string produces humans”.
Which is computationally intractable of course. The usual Bayesian inference issues. Though Bayesian inference would be hard if stochasticity was generated on the fly rather than being initial, too.
I think, empirically, there is a good reason to suspect stochastic models have a lower K-complexity. Normalizing flows (or diffusion models) have three sources of information:
the training code (model architecture + optimizer + hyperparameters),
the initial latent variable drawn from a normal distribution,
the trajectory in a stochastic diff. eq.
And the thing is, they work so much better than non-stochastic models like GANs or ‘partially-stochastic’ models like VAEs (your definition of ‘partially-stochastic’). Now, I get that GANs have a different learning dynamic and VAEs have the wrong optimization target, but it seems that relegating some of the bits to trajectory-choosing makes better models and lower K-complexity.
If the universe has high K complexity than any theoretically best model has to be either stochastic or “inherently complex” (which is worse than stochastic)
That might or might not be the case. From current models in practice having to be stochastic to make good predictions, it doesn’t follow that the theoretically best models must be. But it could be the case.
I’m not sure why ‘partially-stochastic’ would ever fail, due to the coding theorem. That is, there is an alternative way of modeling a model that makes stochastic decisions along the way, where all stochastic decisions are made initially and instead of making a new stochastic decision, you read from these initial bits.
Partially-stochastic has a longer running time, because you have to predict which trajectories work ahead of time. Imagine having to guess the model weights, and then only use the training data to see if your guess checks out. Instead of wasting time finding a better guess, VAEs just say, “all guesses [for the ‘partially-stochastic’ bits] should be equally valid.” We know that isn’t true, so there’s going to be performance issues.
I’m not sure why you’re thinking about guessing model weights here. The thing I’m thinking with stochastic models is the forward pass bit, Monte Carlo sampling. I’m not sure why pre-computed randomness would be a problem for that portion.
As a weird example: Say there’s a memoized random function mapping strings to uniform random bits. This can’t really be pre-computed, because it’s very big. But it can be lazily evaluated, as if pre-computed. Now the stochastic model can query the memoized random function with a unique specification of the situation it’s querying. This should be equivalent to flipping coins mid-run.
Alternatively, if the Monte Carlo process is sequential, then it can just “read the next bit”, that’s computationally simpler.
Maybe it’s not an issue for forward sampling but it is for backprop? Not sure what you mean.
I’m not really sure what you mean either. Here’s a simplified toy that I think captures what you’re saying:
A turtle starts at the origin.
We flip a series of coins—on heads we move +1 in the nth dimension, on tails −1 in the nth dimension.
After N coin flips, we’ll be somewhere in N-d space. It obviously can be described with N bits.
Why are we flipping coins, instead of storing that N-bit string and then reading them off one at a time? Why do we need the information in real time?
Well, suppose you only care about that particular N-bit string. Maybe it’s the code to human DNA. How are you supposed to write down the string before humans exist? You would have to do a very expensive simulation.
If you’re training a neural network on offline data, sure you can seed a pseudo-random number generator and “write the randomness” down early. Training robots in simulation translates pretty well to the real world, so you don’t lose much. Now that I think about it, you might be able to claim the same with VAEs. My issue with VAEs is they add the wrong noise, but that’s probably due to humans not finding the right algorithm rather than the specific distribution being expensive to find.
This seems like a case of Bayesian inference. Like, we start from the observation that humans exist having the properties they are, and then find the set of strings consistent with that. Like, start from a uniform measure on the strings and then condition on “the string produces humans”.
Which is computationally intractable of course. The usual Bayesian inference issues. Though Bayesian inference would be hard if stochasticity was generated on the fly rather than being initial, too.