0.836 Bits of Evidence In Favor of Futarchy

So, I put up some prediction markets on the results of quantified self RCTs. I ran two of the experiments, and scored both markets on the results.

How much should the performance of the market change our opinion about the viability of using prediction platforms to predict RCTs, and thus be plausibly useful in selecting experiments to run and actions to perform?

One issue here is that we only observed two datapoints: Two logscores from the markets on the Pomodoro method (-0.326) and the Vitamin D₃ experiment (-0.333). Qualitatively these are already fairly assuring because they’re both far better than a random score of −0.69. But if we want to quantify the amount of information we’ve gained, we can do that by performing a Bayesian update.

For that, we need a prior. What prior to choose? I tried fiddling around a bit with the exponential distribution, which is the maximum entropy distribution over possible logscores, only nailed down by the mean of the distribution. It represents a state of minimal knowledge.

There’s also the Gamma distribution which is great because it’s a conjugate prior assuming an exponentially distributed underlying data generating process. So, after updating on the datapoints we again get a Gamma-distribution as the posterior.

But I didn’t go with that one because… I wasn’t getting the pretty results I wanted^[1].

With the exponential distribution, a thing that happened was that I’d calculate the resulting distribution after two updates, but the first update would be very aggressive and the second update would then update away harder than the first update had updated towards the datapoints, causing a net information loss. The exponential distribution also has very long tails, resulting in a median of −1.0, which implies that the median market has a logscore that is worse than chance. I don’t believe that to be the case. (The maxent mean implies that the mean market is only as good as chance, which I also wouldn’t believe a priori? I think?)

As for using the Gamma distribution as a prior, I simply don’t think that the underlying data generating process is exponentially distributed, and thus we don’t get any advantage through conjugacy. The Gamma distribution also has two parameters, which is too much to be nailed down with only two datapoints.

So I decided to pick a different prior, and landed on the half normal distribution with half-normally distributed variance (σ ~ HalfNormal(0.5)), which has some nicer properties than the exponential distribution, especially with its thin tails. But the half normal distribution can’t be updated in a closed-form solution, so instead I had to write a short script using pymc. Because of missing conjugacy the resulting distribution is not a half-normal distribution, but something a lot more complicated. I can’t be bothered to try to calculate what it even is.

Summary visualization of the update:

The script^[2] initializes the model with the half normal prior, which in turn has a standard deviation distributed with HalfNormal("sigma", sigma=0.5). We then update on observed=[0.326, 0.333]:

with pm.Model() as adaptive_model:
    σ = pm.HalfNormal('sigma', sigma=0.5)
    obs = pm.HalfNormal('distances', sigma=σ, observed=distances)
    trace = pm.sample(2000, tune=1000, chains=4, target_accept=0.95,
                      return_inferencedata=True)

We can then observe the samples for the new standard deviation

σ_samples = trace.posterior.sigma.values.flatten()
σ_mean = np.mean(σ_samples)

and calculate the log-likelihoods, the Bayes factor, and the number of bits in the update:

results = {}
null_σ = null_sigmas[0]

ll_adaptive = np.sum(stats.halfnorm.logpdf(distances, scale=σ_mean))
ll_null = np.sum(stats.halfnorm.logpdf(distances, scale=null_σ))

log_bf = ll_adaptive - ll_null
bits = log_bf / np.log(2)
bf = np.exp(log_bf)

The whole script has this output:

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 2 jobs)
NUTS: [sigma]

  Progress                                   Draws   Divergences   Step size   Grad evals   Sampling Speed    Elapsed   Remaining
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   3000    0             0.69        1            1165.53 draws/s   0:00:02   0:00:00
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   3000    0             1.12        3            1135.09 draws/s   0:00:02   0:00:00
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   3000    0             0.48        1            600.10 draws/s    0:00:04   0:00:00
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   3000    0             0.78        1            566.57 draws/s    0:00:05   0:00:00

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 5 seconds.
Posterior summary:
        mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
sigma  0.434  0.199   0.156    0.811      0.004    0.004    2582.0    2939.0    1.0

Posterior mean σ: 0.434

vs Null σ = 0.7:
  Log likelihood (adaptive): 0.642
  Log likelihood (null): 0.040
  Evidence: 0.87 bits
  Bayes factor: 1.8:1 in favor of adaptive

Posterior mean σ: 0.434
95% credible interval: [0.156, 0.844]

Evidence: 0.868 bits
(vs null hypothesis σ = 0.7)

Thus: 0.868 bits in favor of futarchy^[3].

Acknowledgements

Many thanks to clippy (twitter) for ~~M500, and~~ Tetraspace (twitter) for ~~M1000, which I used to subsidize the markets. Also many thanks to the manifold admin~~ Genzy for subsidizing each market with ~~M450.~~

Your funding of the sciences is greatly appreciated.

My gratitude also goes out to all the traders on the markets. You help me prioritize, you help us gain knowledge.

↩︎
This is very bad statistical practice. I’m doing this because I want a cutesy title with a positive number of of bits as an update, and because I wanted to learn how to do Bayesian updating using computers.
↩︎
Code here. Thanks to Claude 4 Sonnet for writing the code and walking me through the process.
↩︎
Under several favorably cherry-picked assumptions. Don’t @ me.