[Question] Examples of Highly Counterfactual Discoveries?

johnswentworth23 Apr 2024 22:19 UTC

204 points

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity’s trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: “singular” discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn’t very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia’s list of multiple discoveries.

To that end: what are some examples of discoveries which nobody else was anywhere close to figuring out?

A few tentative examples to kick things off:

Shannon’s information theory. The closest work I know of (notably Nyquist) was 20 years earlier, and had none of the core ideas of the theorems on fungibility of transmission. In the intervening 20 years, it seems nobody else got importantly closer to the core ideas of information theory.
Einstein’s special relativity. Poincaré and Lorentz had the math 20 years earlier IIRC, but nobody understood what the heck that math meant. Einstein brought the interpretation, and it seems nobody else got importantly closer to that interpretation in the intervening two decades.
Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.
Pasteur’s work on the germ theory of disease. There had been both speculative theories and scattered empirical results as precedent decades earlier, but Pasteur was the first to bring together the microscope observations, theory, highly compelling empirical results, and successful applications. I don’t know of anyone else who was close to putting all the pieces together, despite the obvious prerequisite technology (the microscope) having been available for two centuries by then.

(Feel free to debate any of these, as well as others’ examples.)

What links here?

johnswentworth23 Apr 2024 22:19 UTC

204 points

116 comments1 min readLW link

World Modeling

kromem 23 Apr 2024 23:16 UTC
132 points
43
Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.

Survival of the fittest (book 5):

“In the beginning, there were many freaks. Earth undertook Experiments—bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn’t do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.

For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.

Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now.”

Trait inheritance from both parents that could skip generations (book 4):

“Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be.”

Objects of different weights will fall at the same rate in a vacuum (book 2):

“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn’t end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there’s value to looking closer at their methodology.

Which was also super simple:

Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.

Step 2: Look for where single explanations can explain multiple phenomena.

While we have a great methodology for testable hypotheses, the scientific method isn’t very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful ‘counterfactual’ results (it’s served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).
What links here?
- Looking beyond Everett in multiversal views of LLMs by kromem (29 May 2024 12:35 UTC; 10 points)
- Was the historical Jesus talking about proto-evolution? (You might be surprised) by kromem (1 Apr 2025 10:32 UTC; 4 points)
- Garrett Baker 24 Apr 2024 4:14 UTC
  20 points
  4
  Parent
  A precursor to Lucretius’s thoughts on natural selection is Empedocles, who we have far fewer surviving writings from, but which is clearly a precursor to Lucretius’ position. Lucretius himself cites & praises Empedocles on this subject.
  - kromem 25 Apr 2024 1:17 UTC
    2 points
    0
    Parent
    Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying:
    
    “Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:”
    
    (Book 1, lines 740-741)
    
    I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit.
    
    It’s kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don’t have proof of Mochus at all but we do have the former’s writings. We don’t even so much credit Leucippus, Democritus’s teacher, as much as his student for the same reasons, similar to how we refer to “Plato’s theory of forms” and not “Socrates’ theory of forms.”
    
    In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn’t seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don’t think there’s much direct influence on Lucretius’s thinking around proto-evolution, even if there’s arguably plausible influence on Epicurus’s which in turn informed Lucretius.
    - Garrett Baker 25 Apr 2024 17:20 UTC
      2 points
      0
      Parent
      [edit: nevermind I see you already know about the following quotes. There’s other evidence of the influence in Sedley’s book I link below]
      In De Reum Natura around line 716:
      Add, too, whoever make the primal stuff Twofold, by joining air to fire, and earth To water; add who deem that things can grow Out of the four- fire, earth, and breath, and rain; As first Empedocles of Acragas, Whom that three-cornered isle of all the lands Bore on her coasts, around which flows and flows In mighty bend and bay the Ionic seas, Splashing the brine from off their gray-green waves. Here, billowing onward through the narrow straits, Swift ocean cuts her boundaries from the shores Of the Italic mainland. Here the waste Charybdis; and here Aetna rumbles threats To gather anew such furies of its flames As with its force anew to vomit fires, Belched from its throat, and skyward bear anew Its lightnings’ flash. And though for much she seem The mighty and the wondrous isle to men, Most rich in all good things, and fortified With generous strength of heroes, she hath ne’er Possessed within her aught of more renown, Nor aught more holy, wonderful, and dear Than this true man. Nay, ever so far and pure The lofty music of his breast divine Lifts up its voice and tells of glories found, That scarce he seems of human stock create.
      Or for a more modern translation from Sedley’s Lucretius and the Transformation of Greek Wisdom
      Of these [sc. the four-element theorists] the foremost is
      Empedocles of Acragas, born within the three-cornered terres-
      trial coasts of the island [Sicily] around which the Ionian Sea,
      flowing with its great windings, sprays the brine from its green
      waves, and from whose boundaries the rushing sea with its
      narrow strait divides the coasts of the Aeolian land with its
      waves. Here is destructive Charybdis, and here the rumblings of
      Etna give warning that they are once more gathering the wrath
      of their flames so that her violence may again spew out the fire
      flung from her jaws and hurl once more to the sky the lightning
      flashes of flame. Although this great region seems in many ways
      worthy of admiration by the human races, and is said to deserve
      visiting for its wealth of good things and the great stock of men
      that fortify it, yet it appears to have had in it nothing more
      illustrious than this man, nor more holy, admirable, and pre-
      cious. What is more, the poems sprung from his godlike mind
      call out and expound his illustrious discoveries, so that he
      scarcely seems to be born of mortal stock.
- Lukas_Gloor 24 Apr 2024 12:54 UTC
  8 points
  4
  Parent
  Very cool! I used to think Hume was the most ahead of his time, but this seems like the same feat if not better.
  - dr_s 25 Apr 2024 7:46 UTC
    5 points
    0
    Parent
    Democritus also has a decent claim to that for being the first to imagine atoms and materialism altogether.
    - kromem 26 Apr 2024 1:43 UTC
      4 points
      2
      Parent
      Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon.
      
      Through when it comes to antiquity credit isn’t really “first to publish” as much as “first of the last to pass the survivorship filter.”
- francis kafka 24 Apr 2024 2:43 UTC
  8 points
  0
  Parent
  Have you read Michel Serres’s The Birth of Physics? He suggests that the Epicureans and Lucretius in particular have worked out a serious theory of physics that’s closer to thermodynamics and fluid mechanics than Newtonian physics
- Q Home 24 Apr 2024 6:00 UTC
  2 points
  0
  Parent
  
  Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.
  
  The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.
  - Self 30 Apr 2024 7:04 UTC
    6 points
    0
    Parent
    ~Don’t aim for the correct solution, (first) aim for understanding the space of possible solutions
DirectedEvolution 24 Apr 2024 4:32 UTC
38 points
5
A singleton is hard to verify unless there was a long period of time after its discovery during which it was neglected, as in the case of Mendel.
Yet if your discovery is neglected in this way, the context in which it is eventually rediscovered matters as well. In Mendel’s case, his laws were rediscovered by several other scientists decades later. Mendel got priority, but it still doesn’t seem like his accomplishment had much of a counterfactual impact.
In the case of Shannon, Einstein, etc, it’s possible their fields were “ripe and ready” for what they accomplished—as perhaps evidenced by the fact that their discoveries were accepted—and that they were simply plugged in enough to their research communities during a period of faster global dissemination of knowledge that any hot-on-heels competitors never quite got a chance to publish. But I don’t know enough about these cases to be confident.
I can think of a couple cases in which I might be convinced of this sort of counterfactual impact from a scientific singleton:
- All peers in a small, tight-knit research community explicitly stated none of them were even close (though even this is hard to trust—are they being gracious? how do they know their own students wouldn’t have figured it out in another year’s time?). Do we have any such testimonials for Shannon, Einstein, etc?
- The discovery was actually lost, then discovered and immediately appreciated for its significance. Imagine a math proof written in a mathematician’s papers, lost on their death, rediscovered in an antique shop 40 years later, and immediately heralded as a major advance—like if we’d found a proof by Fermat of Fermat’s Last Theorem in an attic in 1950.
- Money was the bottleneck. There are many places a billion dollars can be put into research. If somebody launches a billion-dollar research institute in an underfunded subject that’s been languishing for decades and the institute they founded starts coming up with major technical advances, that’s evidence it was a game-changer. Of course it’s possible that billionaire put their money into the field because they had information that the research was coming to fruition and they wanted to get in on something hot, but I probably have more trouble believing they could make such a prediction so accurately than that their money made a counterfactual impact.
A discovery can also be “counterfactually important” even if it only speeds up science a bit and is only slightly a singleton. Let’s say that every year, there’s one important scientific discovery and a million unimportant ones, and the important ones must be discovered in sequence. If you discover 2025′s important discovery in 2024, all the future important discoveries in the sequence also arrive a year earlier. If each discovery is worth $1 billion/year, then you’ve now created $1 billion counterfactual dollars per year every year as long as this model holds.
Garrett Baker 23 Apr 2024 23:33 UTC
32 points
2
Possibly Wantanabe’s singular learning theory. The math is recent for math, but I think only like ’70s recent, which is long given you’re impressed by a 20-year math gap for Einstein. The first book was published in 2010, and the second in 2019, so possibly attributable to the deep learning revolution, but I don’t know of anyone making the same math—except empirical stuff like the “neuron theory” of neural network learning which I was told about by you, empirical results like those here, and high-dimensional probability (which I haven’t read, but whose cover alone indicates similar content).
- Leon Lang 25 Apr 2024 13:49 UTC
  3 points
  0
  Parent
  I guess (but don’t know) that most people who downvote Garrett’s comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it.
- tailcalled 25 Apr 2024 6:20 UTC
  2 points
  −3
  Parent
  Isn’t singular learning theory basically just another way of talking about the breadth of optima?
  - Alexander Gietelink Oldenziel 25 Apr 2024 10:50 UTC
    8 points
    2
    Parent
    Singular Learning Theory is another way of “talking about the breadth of optima” in the same sense that Newton’s Universal Law of Gravitation is another way of “talking about Things Falling Down”.
    - tailcalled 25 Apr 2024 13:39 UTC
      4 points
      0
      Parent
      Newton’s Universal Law of Gravitation was the first highly accurate model of things falling down that generalized beyond the earth, and it is also the second-most computationally applicable model of things falling down that we have today.
      
      Are you saying that singular learning theory was the first highly accurate model of breadth of optima, and that it’s one of the most computationally applicable ones we have?
      - Alexander Gietelink Oldenziel 25 Apr 2024 15:42 UTC
        14 points
        4
        Parent
        Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!
        But also yes… I think I am saying that
        Singular Learning Theory is the first highly accurate model of breath of optima.
        SLT tells us to look at a quantity Watanabe calls $λ$ , which has the highly-technical name ’real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
        By computing simple examples (see Shaowei’s guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
        The RLCT = $λ$ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
        Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
        It’s one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.
        EDIT: no hype about future work. Wait and see ! :)
        Lucius Bushnaq 25 Apr 2024 23:12 UTC
        18 points
        0
        Parent
        The RLCT = $λ$ first-order term for in-distribution generalization error
        
        Clarification: The ‘derivation’ for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don’t like this derivation very much. See e.g. this one on Wikipedia.
        So what it’s actually showing is just that:
        If you’ve got a class of different hypotheses $M$ , containing many individual hypotheses ${θ_{1}, θ_{2}, \dots θ_{N}}$ .
        And you’ve got a prior ahead of time that says the chance any one of the hypotheses in $M$ is true is some number $p (M) < 1$ ., let’s say it’s $p (M) = 0.8$ as an example.
        And you distribute this total probability $p (M) = 0.8$ around the different hypotheses in an even-ish way, so $p (θ_{i}, M) \propto \frac{1}{N}$ , roughly.
        And then you encounter a bunch of data $X$ (the training data) and find that only one or a tiny handful of hypotheses in $M$ fit that data, so $p (X | θ_{i}, M) \neq 0$ for basically only one hypotheses $θ_{i}$ …
        Then your posterior probability $p (M | X) = \frac{p (X | M) 0.8}{0.8 p (X | M) + 0.2 p (X | \neg M)}$ that the hypothesis $θ_{i}$ is correct will probably be tiny, scaling with $\frac{1}{N}$ . If we spread your prior $p (M) = 0.8$ over lots of hypotheses, there isn’t a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless $p (X | \neg M)$ is really small, i.e. no hypothesis outside the set $M$ can explain the data either.
        So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we’d have $2^{32 k}$ hypotheses if our function fits used $k$ $32$ -bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as $N$ goes to infinity.
        So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we’re spreading our prior around exponentially many hypotheses.… if we have the sort of prior that says all the hypotheses are about equally likely.
        But that’s an insane prior to have! We only have $1.0$ worth of probability to go around, and there’s an infinite number of different hypotheses. Which is why you’re supposed to assign prior based on K-complexity, or at least something that doesn’t go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.
        In summary: badly normalised priors behave badly
        SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don’t line up one-to-one with hypotheses.
        It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.
        
        So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take $N \to \infty$ . That is a basic requirement a sane prior needs to have, so we’re at least not completely shooting ourselves in the foot anymore. But that still doesn’t show why this prior, that neural networks sort of^[1] implicitly have, is actually good. Just that it’s no longer obviously wrong in this specific way.
        Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?
        I dunno. SLT doesn’t say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.
        That’s all though. It doesn’t tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.
        How to make this story tighter?
        If people aim to make further headway on the question of why some function fits generalise somewhat and others don’t, beyond: ‘Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn’t actively bad’, then I’d suggest a starting point might be to make a different derivation for the posterior on the fits that isn’t trying to reason about $p (M)$ defined as the probability that one of the function fits is ‘true’ in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a $150$ billion parameter transformer to internet data, we don’t expect going in that any of these $2^{16 \times 150 \times 10^{9}}$ parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of $M$ , which the SLT derivation of the posterior and most other derivations of this sort I’ve seen seem to implicitly make, we basically have $p (M) \approx 0$ going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like $M ='one of these models might get < 1.1 average loss on holdout data sets'$ .
        SLT in three sentences
        ‘You thought your choice of prior was broken because it’s nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here’s a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished’.
        SLT in one sentence
        ‘Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.’
        ^
        Sorta, kind of, arguably. There’s some stuff left to work out here. For example vanilla SLT doesn’t even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.
        Alexander Gietelink Oldenziel 26 Apr 2024 19:53 UTC
        6 points
        2
        Parent
        I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
        
        EDIT: I have now changed my mind about this, not least because of Lucius’s influence. I currently think Bushnaq’s padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.
        
        The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.
        
        To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
        
        The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
        
        I don’t have the time to recap this story here.
        What links here?
        Alexander Gietelink Oldenziel's comment on Alexander Gietelink Oldenziel’s Shortform by Alexander Gietelink Oldenziel (15 May 2024 20:12 UTC; 2 points)
        mattmacdermott 26 Apr 2024 21:16 UTC
        6 points
        3
        Parent
        Lucius-Alexander SLT dialogue?
        Lucius Bushnaq 27 Apr 2024 7:07 UTC
        5 points
        0
        Parent
        I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
        I don’t think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
        It’s a kind of prior people like to use a lot, but that doesn’t make it a sane choice.
        A well-normalised prior for a regular model probably doesn’t look very continuous or differentiable in this setting, I’d guess.
        To be sure—generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
        The generic symmetries are not what I’m talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.
        The issue of the true distribution not being contained in the model is called ‘unrealizability’ in Bayesian statistics. It is dealt with in Watanabe’s second ‘green’ book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
        I know it ‘deals with’ unrealizability in this sense, that’s not what I meant.
        
        I’m not talking about the problem of characterising the posterior right when the true model is unrealizable. I’m talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class.
        
        But looking at the green book, I see it’s actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I’m going to have to translate more of this into Bayes to know what I think of it.
        
        tailcalled 25 Apr 2024 19:11 UTC
        2 points
        0
        Parent
        The RLCT = $λ$ first-order term for in-distribution generalization error and also Bayesian learning (technically the ‘Bayesian free energy’). This justifies the name of ‘learning coefficient’ for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
        Link(s) to your favorite proof(s)?
        Also, do these match up with empirical results?
        Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won’t be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
        I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don’t really need SLT to inoculate me against that. I’d mainly be interested if SLT shows something beyond that.
        it can be estimated at scale.
        As I read the empirical formulas in this paper, they’re roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.
        But then so they don’t have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.
        This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there’s a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.
        I guess I’m still not super sold on it, but on reflection that’s probably partly because I don’t have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I’m not sure why I’d want/need to study it further.
        There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it’s plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔
        Alexander Gietelink Oldenziel 26 Apr 2024 19:41 UTC
        4 points
        2
        Parent
        All proofs are contained in the Watanabe’s standard text, see here
        
        https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A
      - Lucius Bushnaq 25 Apr 2024 20:30 UTC
        4 points
        0
        Parent
        It’s measuring the volume of points in parameter space with loss $< ϵ$ when $ϵ$ is infinitesimal.
        This is slightly tricky because it doesn’t restrict itself to bounded parameter spaces,^[1] but you can fix it with a technicality by considering how the volume scales with $ϵ$ instead.
        In real networks trained with finite amounts of data, you care about the case where $ϵ$ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss $< ϵ$ , if you can manage that.
        
        I still think SLT has some neat insights that helped me deconfuse myself about networks.
        
        For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss $< ϵ$ using just the eigenvalues of the Hessian. You can’t. At least not in general.
        ^
        Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space
      - Algon 25 Apr 2024 18:48 UTC
        2 points
        0
        Parent
        Second most? What’s the first? Linearization of a Newtonian V(r) about the earth’s surface?
        tailcalled 25 Apr 2024 19:12 UTC
        4 points
        0
        Parent
        Yes.
Alexander Gietelink Oldenziel 25 Apr 2024 11:58 UTC
28 points
7
- Scott Garrabrant’s discovery of Logical Inductors.
I remembered hearing about the paper from a friend and thinking it couldn’t possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic - a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier.
Logical Inductors get around the diagonalization barrier in a very clever way. I won’t spoil how it does here. I recommend the interested reader to watch Andrew’s Critch talk on Logical Induction.
It was the main reason convincing that MIRI != clowns but were doing substantial research.
The Logical Induction paper has a fairly thorough discussion of previous work. Relevant previous work to mention is de Finetti’s on betting and probability, previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer’s work on aggregation of experts. There is also a field which doesn’t have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant.
- The Antikythera artifact: a Hellenistic Computer.
  - You probably learned heliocentrism= good, geocentrism=bad, Copernicus-Kepler-Newton=good epicycles=bad. But geocentric models and heliocentric models are equivalent, it’s just that Kepler & Newton’s laws are best expressed in a heliocentric frame. However, the raw data of observations is actually made in a geocentric frame. Geocentric models stay closer to the data in some sense.
  - Epicyclic theory is now considered bad, an example of people refusing to see the light of scientific revolution. But actually, it was an enormous innovation. Using high-precision gearing epicycles could be actually implemented on a (Hellenistic) computer implicitly doing Fourier analysis to predict the motion of the planets. Astounding.
  - A Roman author (Pliny the Elder?) describes a similar device in posession of Archimedes of Rhodes. It seems likely that Archimedes or a close contemporary (s) designed the artifact and that several were made in Rhodes.
Actually, since we’re on the subject of scientific discoveries
- Discovery & description of the complete Antikythera mechanism. The actual artifact that was found is just a rusty piece of bronze. Nobody knew how it worked. There were several sequential discoveries over multiple decades that eventually led to the complete solution of the mechanism.The final pieces were found just a few years ago. An astounding scientific achievement. Here is an amazing documentary on the subject:
- cousin_it 6 May 2024 8:44 UTC
  4 points
  0
  Parent
  I think Diffractor’s post shows that logical induction does hit a certain barrier, which isn’t quite diagonalization, but seems to me about as troublesome:
  
  As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion… This doesn’t seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn’t a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there’s a trader that could rack up unboundedly high value!
  - abramdemski 2 Dec 2024 16:01 UTC
    5 points
    0
    Parent
    There’s unpublished work about a slightly weaker logical induction criterion which doesn’t have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.^[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn’t count as raking in the cash. The regular LIC (we can call it “strong LIC” or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.
    The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.
    Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.
    ^
    Roughly speaking. This is not quite an adequate description of the theorem.
CronoDAS 24 Apr 2024 23:24 UTC
25 points
3
Antonie van Leeuwenhoek, known as the Father of Microbiology, made the first microscopes capable of seeing microorganisms and is credited as the person who discovered them. He kept his lensmaking techniques secret, however, and microscopes capable of the same magnification didn’t become generally available until many, many years later.
- Alexander Gietelink Oldenziel 25 Apr 2024 10:29 UTC
  25 points
  1
  Parent
  Yes, beautiful example ! Van Leeuwenhoek was the one-man ASML of the 17th century. In this case, we actually have evidence to the counterfactual impact as other lensmakers trailed van Leeuwenhoek by many decades.
  
  It’s plausible that high-precision measurement and fabrication is the key bottleneck in most technological and scientific progress- it’s difficult to oversell the importance of van Leeuwenhoek.
  Antonie van Leeuwenhoek made more than 500 optical lenses. He also created at least 25 single-lens microscopes, of differing types, of which only nine have survived. These microscopes were made of silver or copper frames, holding hand-made lenses. Those that have survived are capable of magnification up to 275 times. It is suspected that Van Leeuwenhoek possessed some microscopes that could magnify up to 500 times. Although he has been widely regarded as a dilettante or amateur, his scientific research was of remarkably high quality.^[39]
  The single-lens microscopes of Van Leeuwenhoek were relatively small devices, the largest being about 5 cm long.^[40]^[41] They are used by placing the lens very close in front of the eye. The other side of the microscope had a pin, where the sample was attached in order to stay close to the lens. There were also three screws to move the pin and the sample along three axes: one axis to change the focus, and the two other axes to navigate through the sample.
  Van Leeuwenhoek maintained throughout his life that there are aspects of microscope construction “which I only keep for myself”, in particular his most critical secret of how he made the lenses.^[42] For many years no one was able to reconstruct Van Leeuwenhoek’s design techniques, but in 1957, C. L. Stong used thin glass thread fusing instead of polishing, and successfully created some working samples of a Van Leeuwenhoek design microscope.^[43] Such a method was also discovered independently by A. Mosolov and A. Belkin at the Russian Novosibirsk State Medical Institute.^[44] In May 2021 researchers in the Netherlands published a non-destructive neutron tomography study of a Leeuwenhoek microscope.^[22] One image in particular shows a Stong/Mosolov-type spherical lens with a single short glass stem attached (Fig. 4). Such lenses are created by pulling an extremely thin glass filament, breaking the filament, and briefly fusing the filament end. The nuclear tomography article notes this lens creation method was first devised by Robert Hooke rather than Leeuwenhoek, which is ironic given Hooke’s subsequent surprise at Leeuwenhoek’s findings.
Jesse Hoogland 24 Apr 2024 4:17 UTC
23 points
4
If you’ll allow linguistics, Pāṇini was two and a half thousand years ahead of modern descriptive linguists.
cubefox 24 Apr 2024 19:26 UTC
21 points
−1
That the earth is a sphere:

Today, we have lost sight of how counter-intuitive it is to believe the earth is not flat. Its spherical shape has been discovered just once, in Athens in the fourth century BC. The earliest extant reference to it being a globe is found in Plato’s Phaedo, while Aristotle’s On the Heavens contains the first examination of the evidence. Everyone who has ever known the earth is round learnt it indirectly from Aristotle.

Thus begins “The Clash Between the Jesuits and Traditional Chinese Square-Earth Cosmology”. The article tells the dramatic story of how some Jesuits tried to establish the spherical-Earth theory in 16th century China, where it was still unknown, partly by creating an elaborate world map to gain the trust of the emperor.

They were ultimately not successful, and the spherical-Earth theory only gained influence in China when Western texts were increasingly translated into Chinese more than two thousand years after the theory was originally invented.

Which makes it a good candidate for one of the most non-obvious / counterfactual theories in history.
- Garrett Baker 24 Apr 2024 19:49 UTC
  9 points
  4
  Parent
  I find this very hard to believe. Shouldn’t Chinese merchants have figured out eventually, traveling long distances using maps, that the Earth was a sphere? I wonder whether the “scholars” of ancient China actually represented the state-of-the-art practical knowledge that the Chinese had.
  
  Nevertheless, I don’t think this is all that counterfactual. If you’re obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you’ll have to discover this fact.
  - ChristianKl 25 Apr 2024 17:52 UTC
    7 points
    2
    Parent
    Merchants were a lot weaker in China than in Europe. Chinese merchants also did a lot less sea voyages due to geography.
    If a bunch of low-status merchants believed that the Earth is a sphere it might not have influenced Chinese high-class beliefs in the same way as beliefs of political powerful merchants in Europe.
  - cubefox 24 Apr 2024 20:29 UTC
    5 points
    0
    Parent
    I see no reason to doubt that the article is accurate. Why would Chinese scholars completely miss the theory if it was obvious among merchants? There should in any case exist some records of it, some maps. Yet none exist. And why would it even be obvious that the Earth is a sphere from long distance travel alone?
    
    Nevertheless, I don’t think this is all that counterfactual. If you’re obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you’ll have to discover this fact.
    
    I don’t think this makes sense. If the Chinese didn’t reinvent the theory in more than two thousand years, this makes it highly “counterfactual”. The longer a theory isn’t reinvented, the less obvious it must be.
    - dr_s 25 Apr 2024 7:56 UTC
      5 points
      0
      Parent
      Maybe it’s the other way around, and it’s the Chinese elite who was unusually and stubbornly conservative on this, trusting the wisdom of their ancestors over foreign devilry (would be a pretty Confucian thing to do). The Greeks realised the Earth was round from things like seeing sails appear over the horizon. Any sailing peoples thinking about this would have noticed sooner or later.
      
      Kind of a long shot, but did Polynesian people have ideas on this, for example?
      - cubefox 25 Apr 2024 11:59 UTC
        1 point
        0
        Parent
        There is a large difference between sooner and later. Highly non-obvious ideas will be discovered later, not sooner. The fact that China didn’t rediscover the theory in more than two thousand years means that it the ability to sail the ocean didn’t make it obvious.
        
        Kind of a long shot, but did Polynesian people have ideas on this, for example?
        
        As far as we know, nobody did, except for early Greece. There is some uncertainty about India, but these sources are dated later and from a time when there was already some contact with Greece, so they may have learned it from them.
        dr_s 26 Apr 2024 11:38 UTC
        2 points
        0
        Parent
        Well, it’s hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn’t who was unusually blind? But it’s an interesting question, it might indeed be a relatively accidental thing which for some reason was accepted sooner than you would have expected (after all, sails disappearing could be explained by an Earth that’s merely dome-shaped; the strongest evidence for a completely spherical shape was probably the fact that lunar eclipses feature always a perfect disc shaped shadow, and even that requires interpreting eclipses correctly, and having enough of them in the first place).
- johnlawrenceaspden 25 Apr 2024 10:57 UTC
  4 points
  −3
  Parent
  I don’t buy this, the curvedness of the sea is obvious to sailors, e.g. you see the tops of islands long before you see the beach, and indeed to anyone who has ever swum across a bay! Inland peoples might be able to believe the world is flat, but not anyone with boats.
  - cubefox 25 Apr 2024 21:05 UTC
    2 points
    0
    Parent
    What’s more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.
Thomas Kwa 24 Apr 2024 2:04 UTC
20 points
4
Maybe Galois with group theory? He died in 1832, but his work was only published in 1846, upon which it kicked off the development of group theory, e.g. with Cayley’s 1854 paper defining a group. Claude writes that there was not much progress in the intervening years:
The period between Galois’ death in 1832 and the publication of his manuscripts in 1846 did see some developments in the theory of permutations and algebraic equations, which were important precursors to group theory. However, there wasn’t much direct progress on what we would now recognize as group theory.
Some notable developments in this period:
1. Cauchy’s work on permutations in the 1840s further developed the idea of permutation groups, which he had first explored in the 1820s. However, Cauchy did not develop the abstract group concept.
2. Plücker’s 1835 work on geometric transformations and his introduction of homogeneous coordinates laid some groundwork for the later application of group theory to geometry.
3. Eisenstein’s work on cyclotomy and cubic reciprocity in the 1840s involved ideas related to permutations and roots of unity, which would later be interpreted in terms of group theory.
4. Abel’s work on elliptic functions and the insolubility of the quintic equation, while published earlier, continued to be influential in this period and provided important context for Galois’ ideas.
However, none of these developments directly anticipated Galois’ fundamental insights about the structure of solutions to polynomial equations and the corresponding groups of permutations. The abstract concept of a group and the idea of studying groups in their own right, independent of their application to equations, did not really emerge until after Galois’ work became known.
So while the 1832-1846 period saw some important algebraic developments, it seems fair to say that Galois’ ideas on group theory were not significantly advanced or paralleled during this time. The relative lack of progress in these 14 years supports the view of Galois’ work as a singular and ahead-of-its-time discovery.
Carl Feynman 25 Apr 2024 1:30 UTC
17 points
6
Wegener’s theory of continental drift was decades ahead of its time. He published in the 1920s, but plate tectonics didn’t take over until the 1960s. His theory was wrong in important ways, but still.
cousin_it 24 Apr 2024 8:31 UTC
17 points
0
I sometimes had this feeling from Conway’s work, in particular, combinatorial game theory and surreal numbers to me feel closer to mathematical invention than mathematical discovery. This kind of things are also often “leaf nodes” on the tree of knowledge, not leading to many followup discoveries, so you could say their counterfactual impact is low for that reason.

In engineering, the best example I know is vulcanization of rubber. It has had a huge impact on today’s world, but Goodyear developed it by working alone for decades, when nobody else was looking in that direction.
- Alexander Gietelink Oldenziel 24 Apr 2024 10:10 UTC
  5 points
  2
  Parent
  Not inconceivable, I would even say plausible, that surreal numbers & combinatorial game theories impact is still in the future.
lemonhope 24 Apr 2024 7:51 UTC
15 points
2
Pasteur had (also highly “counterfactual”) help I think! Ignaz Semmelweis worked in this maternity ward where the women & babies kept dying. The hospital had opened up some investigations over the years as to the cause of death but kept closing them with garbage explanations. He went somewhere else for a while and when he got back he noticed that the death numbers were down in his absence. Then he noticed his hands smelled like death after one of his routine autopsies and he was about to go plunge them in some poor mother! He had washed them but just with regular soap. If he put some bleach in the washwater then his hands didn’t stink. He connected the dots. He had killed hundreds of mothers & babies but wrote a book about it anyway and thereby popularized disinfection (and strongly suggested the root cause of disease).
Probably the main reason that germ theory took so long to work out is that the people with the right evidence were too guilty and ashamed to share it.
junk heap homotopy 24 Apr 2024 14:53 UTC
12 points
0
Set theory is the prototypical example I usually hear about. From Wikipedia:

Mathematical topics typically emerge and evolve through interactions among many researchers. Set theory, however, was founded by a single paper in 1874 by Georg Cantor: “On a Property of the Collection of All Real Algebraic Numbers”.
Alexander Gietelink Oldenziel 24 Apr 2024 8:31 UTC
10 points
0
An example that’s probably * not* a highly counterfactual discovery is the discovery of DNA as the inheritance particle by Watson & Crick [? Wilkins, Franklin, Gosling, Pauling...].

I had great fun reading Watson’s scientific-literary fiction the Double Helix. Watson and Crick are very clear that competitors were hot on their heels, a matter of months, a year perhaps.

EDIT: thank you nitpickers. I should have said structure of DNA, not its role as the carrier of inheritance.
- johnswentworth 24 Apr 2024 15:38 UTC
  5 points
  1
  Parent
  Nitpick: you’re talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.
- tailcalled 25 Apr 2024 6:32 UTC
  3 points
  0
  Parent
  I would say “the thing that contains the inheritance particles” rather than “the inheritance particle”. “Particulate inheritance” is a technical term within genetics and it refers to how children don’t end up precisely with the mean of their parents’ traits (blending inheritance), but rather with some noise around that mean, which particulate inheritance asserts is due to the genetic influence being separated into discrete particles with the children receiving random subsets of their parent’s genes. The significance of this is that under blending inheritance, the genetic variation between organisms within a species would be averaged away in a small number of generations, which would make evolution by natural selection ~impossible (as natural selection doesn’t work without genetic variation).
francis kafka 24 Apr 2024 2:41 UTC
10 points
0
Peter J. Bowler suggests that evolution by natural selection is this in his book “Darwin Deleted”—given that in real life, there was an “eclipse of Darwinism”, he suggests that without Darwin, various non-Darwinian theories of evolution would have been developed further, and evolution by natural selection would have come rather late
- Jesse Hoogland 9 May 2024 16:59 UTC
  7 points
  0
  Parent
  Anecdotally (I couldn’t find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.
- Alexander Gietelink Oldenziel 25 Apr 2024 10:32 UTC
  3 points
  −3
  Parent
  Don’t forget Wallace !
  - francis kafka 26 Apr 2024 10:34 UTC
    3 points
    0
    Parent
    Bowler’s comment on Wallace is that his theory was not worked out to the extent that Darwin’s was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin’s version, it’s not likely or necessary that Darwinism wins out.
    But Wallace’s version of the theory was not the same as Darwin’s, and he had very different ideas about its implications. And since Wallace conceived his theory in 1858, any equivalent to Darwin’s 1859 Origin of Species would have appeared years later.
    Also
    Natural selection, however, was by no means an inevitable expression of mid-nineteenth-century thought, and Darwin was unique in having just the right combination of interests to appreciate all of its key components. No one else, certainly not Wallace, could have articulated the idea in the same way and promoted it to the world so effectively.
    And he points out that minus Darwin, nobody would have paid as much attention to Wallace.
    The powerful case for transmutation mounted in the Origin of Species prompted everyone to take the subject seriously and begin to think more constructively about how the process might work. Without the Origin, few would have paid much attention to Wallace’s ideas (which were in many respects much less radical than Darwin’s anyway). Evolutionism would have developed more gradually in the course of the 1860s and ’70s, with Lamarckism being explored as the best available explanation of adaptive evolution. Theories in which adaptation was not seen as central to the evolutionary process would have sustained an evolutionary program that did not enquire so deeply into the actual mechanism of change, concentrating instead on reconstructing the overall history of life on earth from fossil and other evidence. Only toward the end of the century, when interest began to focus on the topic of heredity (largely as a result of social concerns), would the fragility of the non-Darwinian ideas be exposed, paving the way for the selection theory to emerge at last.
    Bowler also points out that Wallace didn’t really form the connection between both natural and artificial selection.
    - Lukas_Gloor 26 Apr 2024 12:38 UTC
      3 points
      1
      Parent
      In some of his books on evolution, Dawkins also said very similar things when commenting on Darwin vs Wallace, basically saying that there’s no comparison, Darwin had a better grasp of things, justified it better and more extensively, didn’t have muddled thinking about mechanisms, etc.
      - francis kafka 26 Apr 2024 14:27 UTC
        1 point
        0
        Parent
        I mean to some extent, Dawkins isn’t a historian of science, presentism, yadda yadda but from what I’ve seen he’s right here. Not that Wallace is somehow worse, given that of all the people out there he was certainly closer than the rest. That’s about it
johnswentworth 23 Apr 2024 22:30 UTC
9 points
1
Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I’ve already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you’re familiar with the history of any of these enough to say that they clearly were/weren’t very counterfactual, please leave a comment.
- Noether’s Theorem
- Mendel’s Laws of Inheritance
- Godel’s First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
- Feynman’s path integral formulation of quantum mechanics
- Onnes’ discovery of superconductivity
- Pauling’s discovery of the alpha helix structure in proteins
- McClintock’s work on transposons
- Observation of the cosmic microwave background
- Lorentz’s work on deterministic chaos
- Prusiner’s discovery of prions
- Yamanaka factors for inducing pluripotency
- Langmuir’s adsorption isotherm (I have no idea what this is)
- Jan_Kulveit 24 Apr 2024 14:04 UTC
  17 points
  4
  Parent
  Mendel’s Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths
  - XelaP 24 May 2025 23:01 UTC
    5 points
    2
    Parent
    I agree, but, he seems to have rather low counterfactual impact. His discovery was definitely very counterfactual, but it seems like his work was only recognized around the time it would’ve been rediscovered.
  - johnswentworth 24 Apr 2024 15:30 UTC
    1 point
    0
    Parent
    I buy this argument.
- Ben 24 Apr 2024 10:03 UTC
  12 points
  17
  Parent
  I would guess that Lorentz’s work on deterministic chaos does not get many counterfactual discovery points. He noticed the chaos in his research because of his interactions with a computer doing simulations. This happened in 1961. Now, the question is, how many people were doing numerical calculations on computer in 1961? It could plausibly have been ten times as many by 1970. A hundred times as many by 1980? Those numbers are obviously made up but the direction they gesture in is my point. Chaos was a field that was made ripe for discovery by the computer. That doesn’t take anything away from Lorentz’s hard work and intelligence, but it does mean that if he had not taken the leap we can be fairly confident someone else would have. Put another way: If Lorentz is assumed to have had a high counterfactual impact, then it becomes a strange coincidence that chaos was discovered early in the history of computers.
  - johnswentworth 24 Apr 2024 15:29 UTC
    2 points
    0
    Parent
    I buy this argument.
- Alexander Gietelink Oldenziel 24 Apr 2024 8:28 UTC
  11 points
  2
  Parent
  Feymann’s path integral formulation can’t be that counterfactually large. It’s mathematically equivalent to Schwingers formulation and done several years earlier by Tomonaga.
  - johnswentworth 24 Apr 2024 15:28 UTC
    9 points
    2
    Parent
    I don’t buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it’s mathematically equivalent but far simpler conceptually and computationally.
    - Alexander Gietelink Oldenziel 24 Apr 2024 16:40 UTC
      3 points
      −3
      Parent
      Idk the Nobel prize committee thought it wasn’t significant enough to give out a separate prize 🤷
      
      I am not familiar enough with the particulars to have an informed opinion. My best guess is that in general statements to the effect of “yes X also made scientific contribution A but Y phrased it better’ overestimate the actual scientific counterfactual impact of Y. It generically weighs how well outsiders can understand the work too much vis a vis specialists/insiders who have enough hands-on experience that the value-add of a simpler/neater formalism is not that high (or even a distraction).
      
      The reason Dick Feynmann is so much more well-known than Schwinger and Tomonaga surely must not be entirely unrelated with the magnetic charisma of Dick Feynmann.
    - XelaP 24 May 2025 23:24 UTC
      3 points
      0
      Parent
      I think they’re talking about a formulation with the same essential point having come up earlier? I’m personally not familiar with Schwinger’s formulation so cannot intelligently comment much. I’ll also note that the true significance of path integrals took a while to realize (at least going by a comment in Shankar’s Princples of Quantum Mechanics, a standard QM textbook, where the preface to the 2nd edition says something like “In the first edition I put a chapter on path integrals because I thought they were important even though most people don’t include them. Boy, they became really important. I’ve added 100 extra pages on path integrals”)
      
      However, I’ll note that Feynmann diagrams are another example of a conceptual advancement that was huge. Though, it seems like the mathematical development of the perturbation series and the fundamental concept was already around. Furthermore Stueckelberg came up with something similar, but didn’t provide as good a way of mechanically translating perturbation expansion terms into diagrams, and didn’t have the path integral (this is additional evidence for counterfactualness of the path integral, if you can apparently get halfway to Feynmann diagrams without coming up with path integrals). Likewise the diagrams took a while to become standard.
      
      Thus it seems likely that Feynmann was pretty counterfactual here. Plausibly others that may have come up with the notation may have dismissed it like the people that dismissed Feynmann.
      
      Feynmann was also famously good at this sort of conceptual insight, and so I am willing to believe that his unique abilities were actually important here.
- Garrett Baker 24 Apr 2024 18:10 UTC
  6 points
  1
  Parent
  I’ve heard an argument that Mendel was actually counter-productive to the development of genetics. That if you go and actually study peas like he did, you’ll find they don’t make perfect Punnett squares, and from the deviations you can derive recombination effects. The claim is he fudged his data a little in order to make it nicer, then this held back others from figuring out the topological structure of genotypes.
  - Jiro 24 Apr 2024 20:17 UTC
    4 points
    0
    Parent
    I’ve heard, in this context, the partial counterargument that he was using traits which are a little fuzzy around the edges (where is the boundary between round and wrinkled?) and that he didn’t have to intentionally fudge his data in order to get results that were too good, just be not completely objective in how he was determining them.
    
    Of course, this sort of thing is why we have double-blind tests in modern times.
- XelaP 24 May 2025 23:30 UTC
  5 points
  0
  Parent
  Noether’s theorem is an interesting one. The evidence was there, but it’s the sort of discovery that’s incredibly nonobvious even if you have a pile of evidence staring right at you. Perhaps Einstein would’ve gotten it. That she figured it out while working with Hilbert and Einstein on relativity suggests that the ideas that lead to relativity help you think of the ideas of Noether’s Theorem. But I think it’s pretty likely she was quite counterfactual here.
- fig 8 Jun 2026 19:13 UTC
  5 points
  2
  Parent
  It is worth noting that von Neumann only discovered the second incompleteness theorem after having learned of the first from Gödel!
  
  And from what I’ve read, the second is conceptually, if not technically, a fairly immediate consequence of the first. See e.g. this section in the SEP entry on the incompleteness results. Wikipedia calls it an “extension oft the first”.
  
  So don’t see how the independent discovery of the second incompleteness theorem by von Neumann tells us much about the counterfactuality of the first.
- transhumanist_atom_understander 11 May 2024 0:40 UTC
  4 points
  1
  Parent
  Observation of the cosmic microwave background was a simultaneous discovery, according to James Peebles’ Nobel lecture. If I’m understanding this right, Bob Dicke’s group at Princeton was already looking for the CMB based on a theoretical prediction of it, and were doing experiments to detect it, with relatively primitive equipment, when the Bell Labs publication came out.
- XelaP 24 May 2025 22:50 UTC
  3 points
  0
  Parent
  Onnes discovery seems clearly not counterfactual. My understanding was that multiple people were quite interested in the question of what happens to the resistance when you cool something down using the new tech of Dewars (invented by Dewar) and liquefied helium. For example, Dewar himself was looking into it! Onnes was motivated by an ongoing research agenda with multiple researchers trying to do the thing he was trying. Note also that it was a very short time between when the tech to cool down enough was invented to when Onnes made his discovery.
  
  Onnes’s was the first to liquefy helium, but he bought the device he used (which had the novel innovation of exploiting the Joule Thomson effect to liquefy gases) from the inventors of the device (Linde Machine, using the Hampson-Linde cycle). Onnes performed an earlier resistance measuring experiment, this time with mercury, and then observed the superconductivity. Both of these seem like they would’ve been done pretty soon by someone else.
  
  Surely others would’ve tried cooling a bunch more metals in the already ongoing quest to understand the resistance at cold temperatures, and then realized the superconductivity in some of them. Mercury, lead, and niobium superconduct at low temperatures—surely someone would’ve tried metals as obvious as mercury and lead. At the very least, observation of the superfluidity of liquid helium should’ve spurned people into cooling random stuff and seeing if anything weird happened.
- XelaP 24 May 2025 22:59 UTC
  3 points
  0
  Parent
  Langmuir’s adsorption isotherm is a little bit of statistical mechanics that, given my understanding of what you know already, I think you’d find really easy to understand. Undergrad classes derive it nowadays.
  
  If it’s counterfactual, it would have to be due to spurning some development of statistical mechanics, because after some of the basics were developed someone would’ve derived it. I think it was actually a homework problem! All you have to do is consider a two state system (gas molecule attached to substrate/not attached), then use the grand partition function (the chemical potential, case of the partition function), then substitute a term for the value it has for an ideal gas. You’ll then get something that tells you the fraction of the substrate that will have an attached gas molecule. A neat application is hemoglobin and myoglobin attaching oxygen gas.
  
  For a reference, see Chapter 5 Page 140-143 of Kittel’s “Thermal Physics”, a standard book on undergrad level statistical mechanics.
- XelaP 24 May 2025 23:14 UTC
  3 points
  0
  Parent
  CMB seems not counterfactual. The discovers did have to notice it and remain confused about how it was unexplained by problems with their equipment, and then be receptive to being told about a paper about how there might be radiation from the early universe. But the discovers were just looking at a sensitive radio detector meant to detect radio waves reflecting off hot air balloons. Anyone that developed sensitive equipment and then try to see faint signals would’ve noticed the mysterious noise.
  
  Given the sheer importance of radio technology, I think there’d be many instances of people developing a similarly sensitive device and noticing the noise. It surprised me to learn that already at the time there was a paper about the possibility of radiation from the early universe, which plausibly sped up discovery. Note also that some astrophysicists nearby were (independently of the first discoverers, not independently of the paper as some of the people wrote the paper) about to look for a signal in the right region with the explicit intent of looking for background radiation.
  
  So, if anything here is counterfactual, it would be Dicke and Peebles predicting the CMB. But I still don’t buy it, because even if nobody predicted it, people would’ve seen it not that long in the future. In fact before the main discovery in 1964, McKellar in 1941 observed a background appearing like a blackbody with the right temperature while observing the spectra of a star. He even guessed it had some significance.
FreakTakes 2 May 2024 20:23 UTC
7 points
0
Fun question!
IMO Edison and Shannon are both strong candidates for quite different reasons.
Edison solved a bunch of necessary problems in one go when building a working, commercializable lighting system. He did this in an area where many others had only chipped away at corners of the problem. He was not the first to the area...but I don’t think there are any strong claims that the area would have come along nearly as quickly if not for him/his team. I talk about this in-depth in a Works in Progress piece on Edison as an exception technical entrepreneur.
As far as Shannon goes, I’m not saying he initially published on his two major discoveries much earlier than others would have initially published...but Shannon had a sort of uncanny ability to open and largely close a sub-field all in one go. This is rare in scientific branch creation. Usually a process likes this takes something like 5-10 people something like 5-20 years to do. My FreakTakes piece on the early years of molecular biology give a sort of blow-by-blow of what this often looks like. Shannon’s excellence helped circumvent a lot of that. So IMO the thoroughness of his thinking was a huge time-saver.
Mateusz Bagiński 24 Apr 2024 12:00 UTC
7 points
−5
Maybe Hanson et al.’s Grabby aliens model? @Anders_Sandberg said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he “had all of the components [of the model] on the table” and it just didn’t occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it’s less than 10 years, so...
Speaking of Hanson, prediction markets seem like a more central example. I don’t think the idea was [inconceivable in principle] 100 years ago.
ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about “the great silence” from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.
What links here?
- Mateusz Bagiński's comment on Legible vs. Illegible AI Safety Problems by Wei Dai (8 Nov 2025 13:15 UTC; 2 points)
- Alexander Gietelink Oldenziel 25 Apr 2024 9:56 UTC
  17 points
  0
  Parent
  Here’s a document called “Upper and lower bounds for Alien Civilizations and Expansion Rate” I wrote in 2016. Hanson et al. Grabby Aliens paper was submitted in 2021.
  The draft is very rough. Claude summarizes it thusly:
  The document presents a probabilistic model to estimate upper and lower bounds for the number of alien civilizations and their expansion rates in the universe. It shares some similarities with Robin Hanson’s “Grabby Aliens” model, as both attempt to estimate the prevalence and expansion of alien civilizations, considering the idea of expansive civilizations that colonize resources in their vicinity.
  However, there are notable differences. Hanson’s model focuses on civilizations expanding at the highest possible speed and the implications of not observing their visible “bubbles,” while this document’s model allows for varying expansion rates and provides estimates without making strong claims about their observable absence. Hanson’s model also considers the idea of a “Great Filter,” which this document does not explicitly discuss.
  Despite these differences, the document implicitly contains the central insight of Hanson’s model – that the expansive nature of spacefaring civilizations and the lack of observable evidence for their existence imply that intelligent life is sparse and far away. The document’s conclusions suggest relatively low numbers of spacefaring civilizations in the Milky Way (fewer than 20) and the Local Group (up to one million), consistent with the idea that intelligent life is rare and distant.
  The document’s model assumes that alien civilizations will become spacefaring and expansive, occupying increasing volumes of space over time and preventing new civilizations from forming in those regions. This aligns with the “grabby” nature of aliens in Hanson’s model. Although the document does not explicitly discuss the implications of not observing “grabby” aliens, its low estimates for the number of civilizations implicitly support the idea that intelligent life is sparse and far away.
  The draft was never finished as I felt the result wasn’t significant enough. To be clear, the Hanson-Martin-McCarter-Paulson paper contains more detailed models and much more refined statistical analysis. I didn’t pursue these ideas further.
  I wasn’t part of the rationality/EA/LW community. Nobody I talked to was interested in these questions.
  Let this be a lesson for young people: Don’t assume. Publish! Publish in journals. Publish on LessWrong. Make something public even if it’s not in a journal!
- ChrisHibbert 25 Apr 2024 3:22 UTC
  6 points
  0
  Parent
  The Iowa Election Markets were roughly contemporaneous with Hanson’s work. They are often co-credited.
transhumanist_atom_understander 29 Apr 2024 1:43 UTC
6 points
0
Green fluorescent protein (GFP). A curiosity-driven marine biology project (how do jellyfish produce light?), that was later adapted into an important and widely used tool in cell biology. You splice the GFP gene onto another gene, and you’ve effectively got a fluorescent tag so you can see where the protein product is in the cell.

Jellyfish luminescence wasn’t exactly a hot field, I don’t know of any near-independent discoveries of GFP. However, when people were looking for protein markers visible under a microscope, multiple labs tried GFP simultaneously, so it was determined by that point. If GFP hadn’t been discovered, would they have done marine biology as a subtask, or just used their next best option?

Fun fact: The guy who discovered GFP was living near Nagasaki when it was bombed. So we can consider the hypothetical where he was visiting the city that day.
Jordan Taylor 28 Apr 2024 8:02 UTC
6 points
1
Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/
Regarding special relativity, Einstein himself said:^[1]
There is no doubt, that the special theory of relativity, if we regard its development in retrospect, was ripe for discovery in 1905. Lorentz had already recognized that the transformations named after him are essential for the analysis of Maxwell’s equations, and Poincaré deepened this insight still further. Concerning myself, I knew only Lorentz’s important work of 1895 [...] but not Lorentz’s later work, nor the consecutive investigations by Poincaré. In this sense my work of 1905 was independent. [..] The new feature of it was the realization of the fact that the bearing of the Lorentz transformation transcended its connection with Maxwell’s equations and was concerned with the nature of space and time in general. A further new result was that the “Lorentz invariance” is a general condition for any physical theory.
As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein’s ongoing work on general relativity (not just special relativity).
https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now
1. ^
  https://en.m.wikipedia.org/wiki/Relativity_priority_dispute
Jonas Hallgren 24 Apr 2024 14:06 UTC
6 points
6
The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha’s time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*

*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he’s got some pretty good epistemics.

It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel’s Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)
- Valdes 5 Jun 2024 7:46 UTC
  5 points
  0
  Parent
  For anyone wondering TMI almost certainly stands for “The Mind Illuminated”; a book by John Yates, Matthew Immergut, and Jeremy Graves . Full title: The Mind Illuminated: A Complete Meditation Guide Integrating Buddhist Wisdom and Brain Science for Greater Mindfulness
- yanni kyriacos 19 May 2024 11:28 UTC
  1 point
  0
  Parent
  Hi Jonas! Would you mind saying about more about TMI + Seeing That Frees? Thanks!
  - Jonas Hallgren 19 May 2024 12:14 UTC
    1 point
    0
    Parent
    Sure! Anything more specific that you want to know about? Practice advice or more theory?
    - yanni kyriacos 19 May 2024 21:35 UTC
      1 point
      0
      Parent
      Thanks :) Uh, good question. Making some good links? Have you done much nondual practice? I highly recommend Loch Kelly :)
foodforthought 11 Oct 2025 17:45 UTC
5 points
0
My immediate thought is McClintock’s transposable elements. AFAICT, this has only been mentioned by AI-generated lists in this thread, so to fill in a bit more for anyone who doesn’t know the story: in the 1940s, McClintock observed genetic and cytological evidence from crosses of corn plants, which she argued could best be explained by assuming certain genetic elements routinely change their position in the genetic map, often breaking other genes when they insert, and restoring those genes again when they excise. For context, the discovery that genes had fixed positions on linear genetic maps that were collinear with chromosomes was still relatively new (1913), and the field of genetics was largely consumed by the job of determining these maps. Her interpretation was therefore very much against the current, and it was mostly dismissed and derided. But she was right. It wasn’t until molecular biology confirmed their existence in the 60s-70s that transposable elements (“jumping genes”) became widely accepted. She got the Nobel Prize for her discovery over four decades after she made it.

I take it the reason for asking for such case studies is that singular discoveries can be exceptionally impactful, so it would be good to enrich for them. Therefore it’s of interest to ask what happened to McClintock in the intervening decades. My understanding is that she was able to continue her work the entire time, despite the skepticism of the field, due entirely to the Carnegie Institute. Carnegie Institute created a permanent position at Cold Spring Harbor Lab specifically for her, freeing her from teaching and administrative obligations, but more importantly, shielding her from the need for peer acceptance of her ideas (peer-reviewed grants, peer-reviewed papers). Importantly they backed her permanently and unconditionally, so that she was completely free to pursue whatever drove her curiosity, regardless of anyone else’s opinion, even theirs.
This highlights the huge impact a private benefactor (individual or institution) can have by backing individual innovators. The trick is how to figure out who is worth backing. It’s only impactful if one ignores or even actively anti-correlates with the usual metrics that academia rewards; but some or even most marginalized mavericks are in fact crackpots, so anticorrelating isn’t enough. One has to be confident in positively judging people or ideas to be worthwhile, without relying on evaluations by leaders and experts.
Niclas Kupper 24 Apr 2024 11:18 UTC
5 points
0
Grothendiek seems to have been an extremely singular researcher, various of his discoveries would have likely been significantly delayed without him. His work on sheafs is mind bending the first time you see it and was seemingly ahead of its time.
- Alexander Gietelink Oldenziel 25 Apr 2024 10:06 UTC
  6 points
  2
  Parent
  Here are some reflections I wrote on the work of Grothendieck and relations with his contemporaries & predecessors.
  Take it with a grain of salt—it is probably too deflationary of Grothendieck’s work, pushing back on mythical narratives common in certain mathematical circles where Grothendieck is held to be an Christ-like figure. I pushed back on that a little. Nevertheless, it would probably not be an exaggeration to say that Grothendieck’s purely scientific contributions [as opposed to real-life consequences] were comparable to those of Einstein.
EniScien 5 Feb 2026 19:43 UTC
4 points
−1
I am surprised that nobody wrote about it—on lesswrong, but… Bayes theorem. I remember the story how it was lying in a drawer and found only after Bayes’ death.

Per Grok:

Thomas Bayes developed the core idea in the 1740s (published posthumously in 1763), framing inverse probability to update beliefs given evidence. Pierre-Simon Laplace independently rediscovered and significantly extended it starting in 1774, giving it much of its modern form and broad applications—without apparently knowing of Bayes’ work. This is a classic case of independent rediscovery, but with a key caveat: the gap was substantial (roughly 30+ years from Bayes’ work to Laplace’s publication).
- johnswentworth 5 Feb 2026 21:25 UTC
  3 points
  1
  Parent
  In this case, I’m mildly skeptical, because probability before Laplace bore a lot less resemblance to today’s probability IIUC (though I have not personally read source texts, so don’t update too hard on my understanding). Bayes did discover the theorem, but I don’t know if he conceptually thought of it like we do today or would have used it like we do today; I view that as largely coming from Laplace. On the flip side, that means Laplace’ work on probability theory was maybe highly counterfactual.
martinkunev 9 May 2024 4:52 UTC
4 points
0
I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.
If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.
AnthonyC 29 Apr 2024 14:35 UTC
4 points
0
I think it’s worth noting that small delays in discovering new things would, in aggregate, be very impactful. On average, how far apart are the duplicate discoveries? If we pushed all the important discoveries back a couple of years by eliminating whoever was in fact historically first, then the result is a world that is perpetually several years behind our own in everything. This world is plausibly 5-10% poorer for centuries, maybe more if a few key hard steps have longer delays, or if the most critical delays happened a long time ago and were measured in decades or centuries instead.
XelaP 20 Feb 2026 4:31 UTC
3 points
0
Piotr Wozniak, the creator of spaced repetition software (aka, the smart way to remember stuff) SuperMemo (henceforth SM) claims (and I believe him) that he was the one that really got spaced repetition going (for example, the famous graph of the serrated decay curve is allegedly actually due to him, not Ebbinghaus (who didn’t look at spaced reviews).
Though the effect was seen in earlier research, he popularized it and also added his nifty algorithm + software. Here’s a quote:
In 1984, my reasoning about memory was based on two simple intuitions that probably all students have:
- if we review something twice, we remember it better. That’s pretty obvious, isn’t it? If we review it 3 times, we probably remember it even better
- if we remember a set of notes, they will gradually disappear from memory, i.e. not all at once. This is easy to observe in life. Memories have different lifetimes
These two intuitions should make everyone wonder: how fast and how many notes we lose and when we should review next?
To this day, I am amazed that very few people ever bothered to measure that “optimum interval”. When I measured it myself, I was sure I would find more accurate results in books on psychology. I did not. See: Why spaced repetition research kept failing?
Impact:
It looks like basically all spaced repetition software traces back to SM, for example Anki (surveys say 30-80% of med students use it, and the anki mobile for ios is as of writing the 4th top paid app on the US Apple app store) started with SM’s algorithm, and it’s possible that Duolingo and Quizlet’s use was caused by the greater awareness SM caused or by trying to beat competitors (but this is just speculation).
Counterfactuality:
SM started in 1985 and followed an exponential growth curve since. I think the main enablers were computers^[1] and a nerd disgruntled with school. I don’t think the psych research was that important here, Wozniak seems to have done the experiments before knowing the literature. From a brief check on the history of personal computing, it sounds like normal people had them since 1970, idk how common or good enough. This is some evidence against counterfactuality, but I think it’s dwarfed by the arguments in the next paragraph.
Looking forwards, **it seems plausible there’d simply not have been a Wozniak**. Most people working on education focus on schools and they either work in academia (and then not apply the research) or work in schools (which still refuse to use good research or famous well replicated results). Maybe Duolingo and Quizlet^[2] (all around 2010) would’ve done it but it also sounds like they’re still inferior at it. In an alternate universe… we might just not have spaced repetition software at all. The most plausible alternate candidate I’ve heard of^[3] is Math Academy (see their book-length google doc about their methods), as they seem to have put similarly^[4] smart thought into edu-software drawing on good research. They started in 2016, 31 years after SuperMemo.
My takeaways:
Look for applications, and try to optimize. You can sometimes push small effects far enough to unlock new capabilities, if you’re smart about it. Do experiments. Originally it was just some intuitions about how memory worked, but experiments and optimization let you beat self help books (and the latter lets you beat academia).
Control and scale are your friends:
In this case, control via personalized curve fitting, and scale via having computers do it (tens of thousands of cards isn’t even unusual, though it sounds like early on only partial records were kept by the database
I’m reminded of an article about a paper demonstrated nuclear fission that claimed fears of big bombs were unfounded because the reaction was subcritical (and iirc the Manhattan project was already underway) - what changed was control (e.g. neutron reflectors, combining originally separated cores) and scale (“just” get more uranium).
Lastly, you’re more likely to find something where there’s massive civilizational inadequacy (like education), but you probably knew that already.
1. ^
  Quote: > ZX Spectrum 8-bit microcomputer. SuperMemo could not be implemented on ZX Spectrum as the computer lacked disk storage. All programs and data had to be loaded in from a cassette tape. In an overnight simulation, on Feb 22, 1986, I figured out that the buildup of knowledge in spaced repetition is nearly linear, which stands against a popular intuition that backlogs must keep increasing
2. ^
  Maybe Khan Academy? idk if they even use spaced rep.
3. ^
  I haven’t actually searched, I just happened to get linked to Math Academy’s stuff somehow.
4. ^
  Read: competent nerd vibes.
XelaP 5 Feb 2026 12:19 UTC
3 points
0
Ebbinghaus’s work on memory, maybe? For some reason it looks like nobody had plotted memory decay curves despite the experimental apparatus consisting only of yourself, flashcards, a metronome, and either a strong work ethic or a masochistic desire to memorize nonsense as if trapped in a satire of education. He discovered some of the early famous results but more importantly was relatively early in doing empiricism in psychology (and like the first to do so for memory?). Wikipedia states:

With very few works published on memory in the previous two millennia, Ebbinghaus’s works spurred memory research in the United States in the 1890s, with 32 papers published in 1894 alone.

But also, the fact that this was the 1890s makes me think it may not have been that long before someone found it anyways. But also also, the world wars could’ve delayed it in this alternate timeline. So, maybe?
XelaP 20 Jul 2025 22:39 UTC
3 points
0
Possible example: Laennec’s invention of the stethoscope in 1816. Of course we would’ve come up with it eventually. But note that Laennec got his inspiration from kids playing with sticks and from his prudishness about putting his ear to a woman’s chest.

Consider that people have been using sound in diagnosis for millenia. But even something as simple as tapping a finger on another (to e.g. feel and hear the liquid in e.g. your lungs (which you don’t want)) was introduced in the mid 1700s by Auenbrugger (though some medieval guy had it too? Not going to count it since it seemed to not be advanced further) and the method also influenced Laennec. Auenbrugger was inspired by his father’s wine business—you tap the barrel to see how much fluid is in it!

So, consider: anyone ‘could have’ come up with either of these for… literal millenia? But they didn’t? And the main inspiration was stuff most medical practicioners weren’t looking at? Note that Laennec had some experience in flute making that helped him make his stethoscopes.

Lastly (Corvisart)[https://en.wikipedia.org/wiki/Jean-Nicolas_Corvisart] appears to have helped keep the percussion technique of Auenbrugger alive. Laennec learned of percussion from Corvisart’s translation of Auenbrugger—and Corvisart expanded on his findings of how to use the sound info. This isn’t a fundamental discovery, but it looks like he did have significant impact.
Thane Ruthenis 6 Mar 2025 18:56 UTC
3 points
−2
I’ve tried pointing Deep Research at this, doing a three-turn search. Here’s the result, including a GPT-4.5 summary at the end.
- Repeat mentions^[1]: Einstein, Mendel, Ignaz Semmelweis’ antiseptic handwashing, Wegener’s continental drift, Cantor’ set theory, Emmy Noether, McClintock’s transposons.
- New candidates: Charles Babbage’s analytical engine, Barry Marshall and Robin Warren on Helicobacter pylori causing ulcers, Ada Lovelace’s idea of a general-purpose computer, Lev Vygotsky’s sociocultural cognition theory, Joseph Altman’s adult neurogenesis, Dan Shechtman’s quasicrystals.
- Purported common threads: Unconventional background, polymathic skillsets, working in relative isolation.
Hasn’t really been insightful for me, but dropping it here in case it’d be useful for someone else.
1. ^
  As in, those already mentioned by people in this post’s answers.
- XelaP 20 Jul 2025 22:21 UTC
  5 points
  0
  Parent
  Semmelweis, Lister, and Pasteur are great examples. Early adopters of Germ Theory and related issues like sanitation and antiseptics, disbelieved by everyone around them. But you can’t say that they didn’t have impact due to the disbelief—Pasteur was definitely influenced by Lister and Semmelweis, and Pasteur really got the purposefully made vaccines down (whereas with smallpox we lucked out with cowpox happening to already exist). So unlike others whose ideas are sufficiently strange as to be rejected (thus giving good evidence of counterfactual discovery, e.g. Mendel whose work was only rediscovered about when the actual content was being refigured out.), they managed to create huge counterfactual impact.
  
  So I guess, if you can’t convince most people, at least manage to convince a handful of early adopters well positioned to reap the rewards of your ideas?
Shmi 26 Apr 2024 16:48 UTC
−2 points
8
First, your non-standard use of the term “counterfactual” is jarring, though, as I understand, it is somewhat normalized in your circles. “Counterfactual” unlike “factual” means something that could have happened, given your limited knowledge of the world, but did not. What you probably mean is “completely unexpected”, “surprising” or something similar. I suspect you got this feedback before.
Sticking with physics. Galilean relativity was completely against the Aristotelian grain. More recently, the singularity theorems of Penrose and Hawking unexpectedly showed that black holes are not just a mathematical artifact, but a generic feature of the world. A whole slew of discoveries, experimental and theoretical, in Quantum mechanics were almost all against the grain. Probably the simplest and yet the hardest to conceptualize was the Bell’s theorem.
Not my field, but in economics, Adam Smith’s discovery of what Scott Alexander later named Moloch was a complete surprise, as I understand it.
- kave 26 Apr 2024 18:11 UTC
  10 points
  6
  Parent
  What you probably mean is “completely unexpected”, “surprising” or something similar
  I think it means the more specific “a discovery that if it counterfactually hadn’t happened, wouldn’t have happened another way for a long time”. I think this is roughly the “counterfactual” in “counterfactual impact”, but I agree not the more widespread one.
  It would be great to have a single word for this that was clearer.
  - kave 26 Apr 2024 23:45 UTC
    2 points
    0
    Parent
    Maybe “counterfactually robust” is an OK phrase?

Templarrr 24 Apr 2024 8:18 UTC
17 points
2
Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.
Gemini is telling you a popular urban legend-level understanding of what happened. The creation of Penicillin as a random event, “by mistake”, has at most tangential touch with reality. But it is a great story, so it spread like wildfire.
In most cases when we read “nobody investigated” it actually means “nobody succeeded yet, so they weren’t in a hurry to make it known”, which isn’t very informative point of data. No one ever succeeds, until they do. And in this case it’s not even that—antibiotic properties of some molds were known and applied for centuries before that (well, obviously, before the theory of germs they weren’t known as “antibiotic”, just that they helped...), the great work of Fleming and later scientists was about finding the particularly effective type of mold and extracting the exact effective chemical as well as finding a way to produce that at scale.
Wei Dai 25 Apr 2024 3:17 UTC
11 points
1
Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to “achieve high counterfactual impact in your own research” (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I’m a bit confused why you’re asking people to think in the direction outlined in the OP.
Niclas Kupper 24 Apr 2024 11:19 UTC
11 points
0
It would be interesting for people to post current research that they think has some small chance of outputting highly singular results!
niplav 24 Apr 2024 9:28 UTC
9 points
0
I think the Diesel engine would’ve taken 10 years $_{60 %}$ or 20 years $_{45 %}$ longer to be invented: From the Wikipedia article it sounds like it was fairly unintuitive to the people at the time.
MiguelDev 24 Apr 2024 6:52 UTC
9 points
0
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: “singular” discoveries, i.e. discoveries which nobody else was anywhere close to figuring out.

This idea reminds me of the concepts in this post: Focus on the places where you feel shocked everyone’s dropping the ball.
NaiveTortoise 24 Apr 2024 2:21 UTC
4 points
1
Gemini may just be wrong about the mold claim. According to Wikipedia, Ernest Duchesne was curing guinea pigs of typhoid in 1897.
ClareChiaraVincent 14 May 2024 17:47 UTC
3 points
0
I don’t know for sure about Pasteur (not my specialty) but from reading some primary sources from around the end of the spontaneous generation debate (Tyndall I think, can’t quite remember!) I was struck by how much effort it took. I think it was just a lot harder to get from “first idea” to “compelling empirical results” than might immediately be clear!
Review Bot 28 Apr 2024 8:06 UTC
1 point
0
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Johannes C. Mayer 26 Apr 2024 20:52 UTC
1 point
0
A few adjacent thoughts:
- Haskell is powerful in the sense that when your program compiles, you get the program that you actually want a much higher probability compared to most other languages. Many stupid mistakes that are runtime errors in other languages are now compile-time errors. Why is almost nobody using Haskell?
- Why is there basically no widely used homoiconic language, i.e. a language in which you can use the language itself directly to manipulate programs written in the language.
Here we have some technologies that are basically ready to use (Haskell or Clojure), but people decide to mostly not use them. And with people, I mean professional programmers and companions who make software.
- Why did nobody invent Rust earlier, by which I mean a system-level programming language that prevents you from making really dumb mistakes by having the computer check whether you made them?
- Why did it take like 40 years to get a latex replacement, even though latex is terrible in very obvious ways?
These things have in common that there is a big engineering challenge. It feels like maybe this explains it, together with that people who would benefit from these technologies where in the position that the cost of creating them would have exceeded the benefit that they would expect from them.

For Haskell and Clojure we can also consider this point. Certainly, these two technologies have their flaws and could be improved. But then again we would have a massive engineering challenge.
- Radford Neal 27 Apr 2024 17:00 UTC
  1 point
  0
  Parent
  “Why is there basically no widely used homoiconic language”
  Well, there’s Lisp, in its many variants. And there’s R. Probably several others.
  The thing is, while homoiconicity can be useful, it’s not close to being a determinant of how useful the language is in practice. As evidence, I’d point out that probably 90% of R users don’t realize that it’s homoiconic.
  - Johannes C. Mayer 28 Apr 2024 17:02 UTC
    1 point
    0
    Parent
    I am also not sure how useful it is, but I would be very careful with saying that R programmers not using it is strong evidence that it is not that useful. Basically, that was a bit the point I wanted to make with the original comment. Homoiconicity might be hard to learn and use compared to learning a for loop in python. That might be the reason that people don’t learn it. Because they don’t understand how it could be useful. Probably actually most R users did not even hear about homoiconicity. And if they would they would ask “Well I don’t know how this is useful”. But again that does not mean that it is not useful.
    
    Probably many people at least vaguely know the concept of a pure function. But probably most don’t actually use it in situations where it would be advantageous to use pure functions because they can’t identify these situations.
    
    Probably they don’t even understand basic arguments, because they’ve never heard them, of why one would care about making functions pure. With your line of argument, we would now be able to conclude that pure functions are clearly not very useful in practice. Which I think is, at minimum, an overstatement. Clearly, they can be useful. My current model says that they are actually very useful.
    
    [Edit:] Also R is not homoiconic lol. At least not in a strong sense like lisp. At least what this guy on github says. Also, I would guess this is correct from remembering how R looks, and looking at a few code samples now. In LISP your program is a bunch of lists. In R not. What is the data structure instance that is equivalent to this expression: %sumx2y2% <- function(e1, e2) {e1 ^ 2 + e2 ^ 2}?
    - Radford Neal 28 Apr 2024 19:08 UTC
      2 points
      0
      Parent
      R is definitely homoiconic. For your example (putting the %sumx2y2% in backquotes to make it syntactically valid), we can examine it like this:
      > x ← quote (`%sumx2y2%` ← function(e1, e2) {e1 ^ 2 + e2 ^ 2})
      > x
      `%sumx2y2%` ← function(e1, e2) {
      e1^2 + e2^2
      }
      > typeof(x)
      [1] “language”
      > x[[1]]
      `<-`
      > x[[2]]
      `%sumx2y2%`
      > x[[3]]
      function(e1, e2) {
      e1^2 + e2^2
      }
      > typeof(x[[3]])
      [1] “language”
      > x[[3]][[1]]
      `function`
      > x[[3]][[2]]
      $e1
      
      $e2
      
      > x[[3]][[3]]
      {
      e1^2 + e2^2
      }
      And so forth. And of course you can construct that expression bit by bit if you like as well. And if you like, you can construct such expressions and use them just as data structures, never evaluating them, though this would be a bit of a strange thing to do. The only difference from Lisp is that R has a variety of composite data types, including “language”, whereas Lisp just has S-expressions and atoms.
      - Johannes C. Mayer 29 Apr 2024 20:05 UTC
        3 points
        0
        Parent
        Ok, I was confused before. I think Homoiconicity is sort of several things. Here are some examples:
        
        In basically any programming language L, you can have program A, that can write a file containing a valid L source code that is then run by A.
        In some sense, python is homoiconic, because you can have a string and then exec it. Before you exec (or in between execs) you can manipulate the string with normal string manipulation.
        In R you have the quote operator which allows you to take in code and return and object that represents this code, that can be manipulated.
        In Lisp when you write an S-expression, the same S-expression can be interpreted as a program or a list. It is actually always a (possibly nested) list. If we interpret the list as a program, we say that the first element in the list is the symbol of the function, and the remaining entries in the list are the arguments to the function.
        
        Although I can’t put my finger on it exactly, to me it feels like the homoiconicity is increasing in further down examples in the list.
        
        The basic idea though seems to always be that we have a program that can manipulate the representation of another program. This is actually more general than homoiconicity, as we could have a Python program manipulating Haskell code for example. It seems that the further we go down the list, the easier it gets to do this kind of program manipulation.
segfault 25 Apr 2024 4:16 UTC
0 points
0
Could you define what you mean here by counterfactual impact?

My knowledge of the word counterfactual comes mainly from the blockchain world, where we use it in the form of “a person could do x at any time, and we wouldn’t be able to stop them, therefore x is counterfactually already true or has counterfactually already occured”
- ChristianKl 26 Apr 2024 13:40 UTC
  6 points
  1
  Parent
  Counterfactual means, that if something would not have happened something else would have happened. It’s a key concept in Judea Pearl’s work on causality.
- segfault 3 Oct 2024 4:02 UTC
  −15 points
  0
  Parent
  Hmmmm.

[Question] Examples of Highly Counterfactual Discoveries?

In summary: badly normalised priors behave badly

How to make this story tighter?

SLT in three sentences

SLT in one sentence

Impact:

Counterfactuality:

My takeaways:

Control and scale are your friends: