johnswentworth comments on johnswentworth’s Shortform

johnswentworth 7 Oct 2025 16:18 UTC
4 points
0
I’m assuming, for simplicity, that each $X_{i}$ has finitely many values. The sum on $X$ is then a sum on the cartesian product of the values of each $X_{i}$ , which we can rewrite in general as $\sum_{X} g (X) = \frac{1}{\prod_{i} n_{i}} E^{Q} [g (X)]$ , where $Q$ is the uniform distribution on $X$ and $n_{i}$ is the number of values of $X_{i}$ . That uniform distribution $Q$ is a product of uniform distributions over each individual $X_{i}$ , i.e. $Uniform [X] = \prod_{i} Uniform [X_{i}]$ , so the $X_{i}$ ‘s are all independent under $Q$ . So, under $Q$ , the $f_{i} (X_{i})$ ’s are all independent.
Did that clarify?
This expression looks extremely like the kind of thing you’d usually want to calculate with Feynman diagrams, except I’m not sure whether the $f_{i} (x_{i})$ have the right form to allow us to power expand in $x_{i}$ and then shove the non-quadratic $x_{i}$ terms into source derivatives the way we usually would in perturbative quantum field theory.
Yup, it sure does look similar. One tricky point here is that we’re trying to fit the $f$ ’s to the data, so if going that route we’d need to pick some parametetric form for $f$ . We’d want to pick a form which always converges, but also a form general enough that the fitting process doesn’t drive $f$ to the edge of our admissible region.
- Lucius Bushnaq 7 Oct 2025 18:52 UTC
  4 points
  0
  Parent
  Did that clarify?
  Yes. Seems like a pretty strong assumption to me.
  Yup, it sure does look similar. One tricky point here is that we’re trying to fit the $f$ ’s to the data, so if going that route we’d need to pick some parametetric form for $f$ .
  Ah. In that case, are you sure you actually need $Z$ to do the model comparisons you want? Do you even really need to work with this specific functional form at all? As opposed to e.g. training a model $p (λ ∣ X)$ to feed its output into $m$ tiny normalizing flow models which then try to reconstruct the original input data with conditional probability distributions $q_{i} (x_{i} ∣ λ)$ ?
  
  To sketch out a little more what I mean, $p (λ ∣ X)$ could e.g. be constructed as a parametrised function^[1] which takes in the actual samples $X$ and returns the mean of a Gaussian, which $λ$ is then sampled from in turn^[2]. The $q_{i} (x_{i} ∣ λ)$ would be constructed using normalising flow networks^[3], which take in $λ$ as well as uniform distributions over variables $z_{i}$ that have the same dimensionality as their $x_{i}$ . Since the networks are efficiently invertible, this gives you explicit representations of the conditional probabilities $q_{i} (x_{i} ∣ λ)$ , which you can then fit to the actual data using KL-divergence.
  You’d get explicit representations for both $P [λ ∣ X]$ and $P [X ∣ λ]$ from this.
  1. ^
    Or ensemble of functions, if you want the mean of $λ$ to be something like $\sum_{i} f_{i} (x_{i})$ specifically.
  2. ^
    Using reparameterization to keep the sampling operation differentiable in the mean.
  3. ^
    If the dictionary of possible values of $X$ is small, you can also just use a more conventional ml setup which explicitly outputs probabilities for every possible value of every $x_{i}$ of course.
  - johnswentworth 7 Oct 2025 19:13 UTC
    4 points
    0
    Parent
    That would be pretty reasonable, but it would make the model comparison part even harder. I do need P[X] (and therefore Z) for model comparison; this is the challenge which always comes up for Bayesian model comparison.
    - Lucius Bushnaq 7 Oct 2025 19:48 UTC
      4 points
      0
      Parent
      Why does it make Bayesian model comparison harder? Wouldn’t you get explicit predicted probabilities for the data $X$ from any two models you train this way? I guess you do need to sample from the Gaussian in $λ$ a few times for each $X$ and pass the result through the flow models, but that shouldn’t be too expensive.