Yup, it sure does look similar. One tricky point here is that we’re trying to fit the f’s to the data, so if going that route we’d need to pick some parametetric form for f.
Ah. In that case, are you sure you actually need Z to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model p(λ∣X) to feed its output into m tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions qi(xi∣λ)?
To sketch out a little more what I mean, p(λ∣X) could e.g. be constructed as a parametrised function[1] which takes in the actual samples X and returns the mean of a Gaussian, which λ is then sampled from in turn[2]. The qi(xi∣λ) would be constructed using normalising flow networks[3], which take in λ as well as uniform distributions over variables zi that have the same dimensionality as their xi. Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities qi(xi∣λ), which you can then fit to the actual data using KL-divergence.
You’d get explicit representations for both P[λ∣X] and P[X∣λ] from this.
If the dictionary of possible values of X is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every xi of course.
That would be pretty reasonable, but it would make the model comparison part even harder. I do need P[X] (and therefore Z) for model comparison; this is the challenge which always comes up for Bayesian model comparison.
Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data X from any two models you train this way? I guess you do need to sample from the Gaussian in λ a few times for each X and pass the result through the flow models, but that shouldn’t be too expensive.
Yes. Seems like a pretty strong assumption to me.
Ah. In that case, are you sure you actually need Z to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model p(λ∣X) to feed its output into m tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions qi(xi∣λ)?
To sketch out a little more what I mean, p(λ∣X) could e.g. be constructed as a parametrised function[1] which takes in the actual samples X and returns the mean of a Gaussian, which λ is then sampled from in turn[2]. The qi(xi∣λ) would be constructed using normalising flow networks[3], which take in λ as well as uniform distributions over variables zi that have the same dimensionality as their xi. Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities qi(xi∣λ), which you can then fit to the actual data using KL-divergence.
You’d get explicit representations for both P[λ∣X] and P[X∣λ] from this.
Or ensemble of functions, if you want the mean of λ to be something like ∑ifi(xi) specifically.
Using reparameterization to keep the sampling operation differentiable in the mean.
If the dictionary of possible values of X is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every xi of course.
That would be pretty reasonable, but it would make the model comparison part even harder. I do need P[X] (and therefore Z) for model comparison; this is the challenge which always comes up for Bayesian model comparison.
Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data X from any two models you train this way? I guess you do need to sample from the Gaussian in λ a few times for each X and pass the result through the flow models, but that shouldn’t be too expensive.