I’m assuming, for simplicity, that each Xi has finitely many values. The sum on X is then a sum on the cartesian product of the values of each Xi, which we can rewrite in general as ∑Xg(X)=1∏iniEQ[g(X)], where Q is the uniform distribution on X and ni is the number of values of Xi. That uniform distribution Q is a product of uniform distributions over each individual Xi, i.e. Uniform[X]=∏iUniform[Xi], so the Xi‘s are all independent under Q. So, under Q, the fi(Xi)’s are all independent.
Did that clarify?
This expression looks extremely like the kind of thing you’d usually want to calculate with Feynman diagrams, except I’m not sure whether the fi(xi)have the right form to allow us to power expand in xi and then shove the non-quadratic xi terms into source derivatives the way we usually would in perturbative quantum field theory.
Yup, it sure does look similar. One tricky point here is that we’re trying to fit the f’s to the data, so if going that route we’d need to pick some parametetric form for f. We’d want to pick a form which always converges, but also a form general enough that the fitting process doesn’t drive f to the edge of our admissible region.
Yup, it sure does look similar. One tricky point here is that we’re trying to fit the f’s to the data, so if going that route we’d need to pick some parametetric form for f.
Ah. In that case, are you sure you actually need Z to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model p(λ∣X) to feed its output into m tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions qi(xi∣λ)?
To sketch out a little more what I mean, p(λ∣X) could e.g. be constructed as a parametrised function[1] which takes in the actual samples X and returns the mean of a Gaussian, which λ is then sampled from in turn[2]. The qi(xi∣λ) would be constructed using normalising flow networks[3], which take in λ as well as uniform distributions over variables zi that have the same dimensionality as their xi. Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities qi(xi∣λ), which you can then fit to the actual data using KL-divergence.
You’d get explicit representations for both P[λ∣X] and P[X∣λ] from this.
If the dictionary of possible values of X is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every xi of course.
That would be pretty reasonable, but it would make the model comparison part even harder. I do need P[X] (and therefore Z) for model comparison; this is the challenge which always comes up for Bayesian model comparison.
Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data X from any two models you train this way? I guess you do need to sample from the Gaussian in λ a few times for each X and pass the result through the flow models, but that shouldn’t be too expensive.
I’m assuming, for simplicity, that each Xi has finitely many values. The sum on X is then a sum on the cartesian product of the values of each Xi, which we can rewrite in general as ∑Xg(X)=1∏iniEQ[g(X)], where Q is the uniform distribution on X and ni is the number of values of Xi. That uniform distribution Q is a product of uniform distributions over each individual Xi, i.e. Uniform[X]=∏iUniform[Xi], so the Xi‘s are all independent under Q. So, under Q, the fi(Xi)’s are all independent.
Did that clarify?
Yup, it sure does look similar. One tricky point here is that we’re trying to fit the f’s to the data, so if going that route we’d need to pick some parametetric form for f. We’d want to pick a form which always converges, but also a form general enough that the fitting process doesn’t drive f to the edge of our admissible region.
Yes. Seems like a pretty strong assumption to me.
Ah. In that case, are you sure you actually need Z to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model p(λ∣X) to feed its output into m tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions qi(xi∣λ)?
To sketch out a little more what I mean, p(λ∣X) could e.g. be constructed as a parametrised function[1] which takes in the actual samples X and returns the mean of a Gaussian, which λ is then sampled from in turn[2]. The qi(xi∣λ) would be constructed using normalising flow networks[3], which take in λ as well as uniform distributions over variables zi that have the same dimensionality as their xi. Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities qi(xi∣λ), which you can then fit to the actual data using KL-divergence.
You’d get explicit representations for both P[λ∣X] and P[X∣λ] from this.
Or ensemble of functions, if you want the mean of λ to be something like ∑ifi(xi) specifically.
Using reparameterization to keep the sampling operation differentiable in the mean.
If the dictionary of possible values of X is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every xi of course.
That would be pretty reasonable, but it would make the model comparison part even harder. I do need P[X] (and therefore Z) for model comparison; this is the challenge which always comes up for Bayesian model comparison.
Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data X from any two models you train this way? I guess you do need to sample from the Gaussian in λ a few times for each X and pass the result through the flow models, but that shouldn’t be too expensive.