Tldr; I don’t think that this post stands up to close scrutiny although there may be unknown knowns anyway. This is partly due to a couple of things in the original paper which I think are a bit misleading for the purposes of analysing the markets.
The unknown knowns claim is based on 3 patterns in the data:
“The mean prediction market belief of replication is 63.4%, the survey mean was 60.6% and the final result was 61.9%. That’s impressive all around.”
“Every study that would replicate traded at a higher probability of success than every study that would fail to replicate.”
“None of the studies that failed to replicate came close to replicating, so there was a ‘clean cut’ in the underlying scientific reality.”
Taking these in reverse order:
I don’t think that there is as clear a distinction between successful and unsuccessful replications as stated in the OP:
“None of the studies that failed to replicate came close to replicating”
This assertion is based on a statement in the paper:
“Second, among the unsuccessful replications, there was essentially no evidence for the original finding. The average relative effect size was very close to zero for the eight findings that failed to replicate according to the statistical significance criterion.”
However this doesn’t necessarily support the claim of a dichotomy – the average being close to 0 doesn’t imply that all the results were close to 0, nor that every successful replication passed cleanly. If you ignore the colours, this graph from the paper suggests that the normalised effect sizes are more of a continuum than a clean cut (central section b is relevant chart).
Eyeballing that graph, there is 1 failed replication which nearly succeeded and 4 successful which could have failed. If the effect size shifted by less than 1 S.D. (some of them less than 0.5 S.D.) then the success would have become a failure or vice-versa (although some might have then passed at stage 2). 
Of the 5 replications noted above, the 1 which nearly passed was ranked last by market belief, the 4 which nearly failed were ranked 3, 4, 5 and 7. If any of these had gone the other way it would have ruined the beautiful monotonic result.
According to the planned procedure , the 1 study which nearly passed replication should have been counted as a pass as it successfully replicated in stage 1 and should not have proceeded to stage 2 where the significance disappeared. I think it is right to count this as an overall failed replication but for the sake of analysing the market it should be listed as a success.
Having said that, the pattern is still a very impressive result which I look into below.
The OP notes that there is a good match between the mean market belief of replication and the actual fraction of successful replications. To me this doesn’t really suggest much by way of whether the participants in the market were under-confident or not. If they were to suddenly become more confident then the mean market belief could easily move away from the result.
If the market is under-confident, it seems like one could buy options in all the markets trading above 0.5 and sell options in all the ones below and expect to make a profit. If I did this then I would buy options in 16⁄21 (76%) of markets and would actually increase the mean market belief away from the actual percentage of successful replications. By this metric becoming more confident would lower accuracy.
In a similar vein, I also don’t think Spearman coefficients can tell us much about over/under-confidence. Spearman coefficients are based on rank order so if every option on the market became less/more confident by the same amount, the Spearman coefficients wouldn’t change.
Notwithstanding the above, the graph in the OP still looks to me as though the market is under-confident. If I were to buy an option in every study with market belief >0.5 and sell in every study <0.5 I would still make a decent profit when the market resolved. However it is not clear whether this is a consistent pattern across similar markets.
Fortunately the paper also includes data on 2 other markets (success in stage 1 of the replication based on 2 different sets of participants) so it is possible to check whether these markets were similarly under-confident. 
If I performed the same action of buying and selling depending on market belief I would make a very small gain in one market and a small loss in the other. This does not suggest that there is a consistent pattern of under-confidence.
It is possible to check for calibration across the markets. I split the 63 market predictions (3 markets x 21 studies) into 4 groups depending on the level of market belief, 50-60%, 60-70%, 70-80% and 80-100% (any market beliefs with p<50% are converted to 1-p for grouping).
For beliefs of 50-60% confidence, the market was correct 29% of the time. Across the 3 markets this varied from 0-50% correct.
For beliefs of 60-70% confidence, the market was correct 93% of the time. Across the 3 markets this varied from 75-100% correct.
For beliefs of 70-80% confidence, the market was correct 78% of the time. Across the 3 markets this varied from 75-83% correct.
For beliefs of 80-100% confidence, the market was correct 89% of the time. Across the 3 markets this varied from 75-100% correct.
We could make a claim that anything which the markets show in the 50-60% range are genuinely uncertain but that for everything above 60% we should just adjust all probabilities to at least 75%, maybe something like 80-85% chance.
If I perform the same buying/selling that I discussed previously but set my limit to 0.6 instead of 0.5 (i.e. don’t buy or sell in the range 40%-60%) then I would make a tidy profit in all 3 markets.
But I’m not sure whether I’m completely persuaded. Essentially there is only one range which differs significantly from the market being well calibrated (p=0.024, two-tailed binomial). If I adjust for multiple hypothesis testing this is no longer significant. There is some Bayesian evidence here but not enough to completely persuade me.
I don’t think the paper in question provides sufficient evidence to conclude that there are unknown knowns in predicting study replication. It is good to know that we are fairly good at predicting which results will replicate but I think the question of how well calibrated we are remains an open topic.
Hopefully the replication markets study will give more insights into this.
 The replication was performed in 2 stages. The first was intended to have a 95% change of finding an effect size of 75% of the original finding. If the study replicated here it was to stop and ticked off as a successful replication. Those that didn’t replicate in stage 1 proceeded to stage 2 where the sample size was increased in order to have a 95% change of finding effect sizes at 50% of the original finding.
 Fig 7 in the supplementary information shows the same graph as in the OP but basing on Treatment 1 market beliefs which relate to stage 1 predictions. This still looks quite impressively monotonic. However the colouring system is misleading for analysing market success as the colouring system related to success after stage 2 of the replication but the market was predicting stage 1. If this is corrected then the graph look a lot less monotonic, flipping the results for Pyc & Rawson (6th), Duncan et al. (8th) and Ackerman et al. (19th).
In an innovation workshop we were taught the following technique:
Make a list of 6 things your company is good at
Make a list of 6 applications of your product(s)
Make a list of 6 random words (Disney characters? City names?)
Roll 3 dice and select the corresponding words from the lists. Think about those 3 words and see what ideas you can come up with based on them.
Everyone I spoke to agreed that this was the best technique which we were taught. I knew constrained creativity was a thing but I think using this technique really drove the point home. I don’t think this is quite the same thing as traditional divination (e.g. you can repeat this a few times and then choose your best idea) but I wonder if it is relying on similar principles.
Fun fact: 7 survey respondents attempted to convert the number of minutes between them and their twin into a fraction of a year (e.g. 9.506E-06 years is 5 minutes). All 7 who did this were the older twin.
(I did include these people in the analysis above)
This provides evidence for the “Older twins care about being the oldest, younger twins don’t talk about it” hypothesis. I don’t think this will come as a massive surprise to anyone.
I understand that the price to swap birth order with your twin is a bowl of soup, although adjusting for 1% yearly inflation over 4000 years this now comes to 193 quadrillion bowls of soup.
I have mixed feelings about this. On one hand, I’m glad you wrote it as openness seems like the first step to knowledge. On the other, I think you’re dealing with your evidence wrongly.
To me it feels like you’ve been discovering something new (rationality) and found a way to fit it into your existing belief system. On the inside this feels like it confirms your belief system but from the outside it looks like privileging the hypothesis. One of the main things I got from Thinking: Fast and Slow was that being able to tell ourselves a convincing story feels like we’re discovering the truth but actually the convincing-ness of the story is orthogonal to truth.
If we grant that Christians invented science then maybe this can be counted as evidence for Christianity, but is it strong evidence? A rough estimate might be that 1⁄6 people who have ever lived were Christian so I don’t think that it should be overly surprising that one of them was the inventor. I know this is a horrendous method for choosing a prior but it gives an indicator that evidence of what Christians have done in the past is unlikely to be strong evidence either way.
If you count this as evidence for Christianity then you need to count similar evidence too. Should the other historical figures before the 12th century who contributed to science and maths count as evidence that their beliefs are true? Compared to the number of Christians who have ever lived, the number of ancient Greeks who ever lived is tiny so it is incredible that they got as far as they did.
To someone looking in from the outside, claiming that Christianity is different because it gave a reason for believing the world would be consistent again seems like privileging the hypothesis. Those other ancient figures seemed to assume that the world would be consistent even without Christianity so even in your belief system there doesn’t seem to be an a priori reason to believe that they couldn’t have invented the scientific method.
It took 12 centuries after Christ to invent the scientific method so it would also seem to be true that believing in Christianity wasn’t a massively strong driver towards inventing the scientific method.
To put my cards on the table, until a couple of years ago I was in a similar situation to you. I believed in Christianity and didn’t expect ever to be dissuaded.
I’m not sure that I can pinpoint exactly what changed for me. One big part of it was the realisation that I didn’t have to believe or not believe in Christianity − 0 and 1 are not probabilities. What was more I realised I already didn’t 100% believe in everything in Christianity—there were already plenty of things that I found incredibly confusing but kinda just accepted because they were part of a parcel of beliefs. I guess you might be similar but may have different issues—mine included the trinity, free will vs God’s sovereignty, differences between new and old testaments, suffering, # of fertilised eggs which never even implant into the womb (I know, that one is probably fairly idiosyncratic).
When I allowed myself to see my belief in Christianity on a scale I was able to modify how much I believed it based on evidence I saw. Before that any new evidence was judged on whether it allowed me to believe Christianity rather than whether it encouraged me to believe. I should note that from a Christian point of view this seems to be a virtue not a vice—Christianity seems to imply that you should only believe in Christianity if it is true so looking accurately at the evidence should be encouraged.
Over a few months my belief slowly waned as more evidence came in. I think the tipping point for me was realising how badly designed human intelligence is. The likelihood of God inventing something so poor in absolute terms to be the pinnacle of his creation was enough to push me over the edge. Again, this is probably fairly idiosyncratic!
I’m not sure exactly what you were hoping for in response to your introduction but I hoped my experience might be interesting to you.
It’s notable that, for countries where anti-social punishment is significant, the mean contribution across rounds doesn’t depend much on the level of anti-social punishment but more on the contributions in the first round; for the 7 countries with the most total anti-social punishment, their lines are all fairly flat.
Mean contribution and antisocial punishment (eyeballed from fig 2B in the report) aren’t correlated within the group of 7 (R2=0.08).
Mean contribution and initial contribution (eyeballed from fig 2A) are correlated within the group at R2=0.99!
So in countries with low rule of law you are stuck in whatever position you start in. Pity poor Istanbul which wasn’t really that bad at anti-social punishment but started at a low level and so were stuck there.
Imagine the experimenters had lied to each participant about what happened in round 1 to make it seem like everyone else was contributing more. Would the players stay at the high (made-up) rate of contributions for the rest of the 9 rounds?
Would it be fair to say that any historical data on successful scientists/mathematicians will be over represented by firstborns due to primogeniture inheritance laws and customs? Historically those involved in the sciences mainly had to be independently wealthy and being a first born would tend to help with that with those born after more likely to have to work for a living. Maybe famous historical lawyers would tend to be under represented by firstborns?
I’d expect this to be a fairly large selection effect similar in size to the Less Wrong survey but presumably caused by a different mechanism.
Possibly a data set which would have more bearing on the question of birth order effects in modern times would be Fields medal, Abel prize, Turing award, Nobel prizes in Physics, Chemistry, Medicine and Economics in the last 30 years or so—I don’t have a great feel for how long ago the primogeniture inheritance thingy stopped being relevant but given an average Nobel laureate age of 59 this would mean people born since ~1930. These might be easier to find data on than Thales of Miletus too!!