Arguing against consistency itself. “I was trying to be consistent when I was younger, but now I’m more wise than that.”
Ian Televan
I’m not sure whether the explanation at the end was right, but this is a very powerful technique nonetheless. I observed a similar problem many times, but couldn’t quite put my finger on it.
(Of course I don’t know how the authors actually come up with the hypothesis and I could be wrong, and the conclusions seem very plausible anyway, but..) The study seem to be susceptible to stopping bias.
If the correlation was very strong right away, they could’ve said “Parental grief directly correlates with reproductive potential, Q.E.D!”
It wasn’t, but they found a group resembling early hunter-gatherers; with the conclusion “Parental grief directly correlates with reproductive potential from back then, Q.E.D!”
If this didn’t turn out either, and the correlation had peaked for some values in the middle, they could’ve said “Parental grief correlates with reproductive potential from back then, and it is also influenced by the specifics of the current society, Q.E.D!”
I tried to reason through the riddles, before reading the rest and I made the same mistake as the jester did. It is really obvious in hindsight; I thought about this concept earlier and I really thought I had understood it. Did not expect to make this mistake at all, damn.
I even invented some examples on my own, like in the programming language Python a statement like print(“Hello, World!”) is an instruction to print “Hello, World!” on the screen, but “print(\”Hello, World!\”)” is merely a string, that represents the first string, it’s completely inert. (in an interactive environment it would display “print(“Hello, World!”)” on the screen, but still not “Hello, World!”).
Edit: I think I understand what went wrong with my reasoning. Usually, distinguishing a statement from a representation of a statement is not difficult. To get a statement from a representation of a statement you must interpret the representation once. And this is rather easy, for example, when I’m reading these essays, I am well aware that the universe doesn’t just place these statements of truth into my mind, but instead, I’m reading what Eliezer wrote down and I must interpret it. It is always “Eliezer writes ‘X’”, and not just “X”.
But in this example, there were 2 different levels of representation. To get to the jester and the king I need to interpret the words once. But to get to the inscriptions, I must interpret the words twice. This is what went wrong. If I correctly understood the root of my mistake, then, if I was in jester’s shoes, I wouldn’t have made this mistake. Therefore, I think, my mistake is not the same as jester’s. Simultaneous interpretation of different levels of representation is something to be vigilant about.
C’est ne pas un pipe. This is not a picture of a pipe either, this is a picture of a picture of a pipe. Or is this a piece text, saying “this is a picture of a picture of a pipe”? Or is this a piece of text, saying “This is a piece of text, saying \”this is a picture… \”″… :-)
Fascinating subject indeed!
I wonder how one would need to modify this principle to take into account risk-benefit analysis. What if quickly identifying wiggins meant incurring great benefit or avoiding great harm, then you would still need a nice short word for them. This seems obvious, the question is only how much shorter would the word need to be.
Labels that are both short and phonetically consistent with a given language are in short supply, therefore we would predict that sometimes even unrelated things shared labels—if they occupied sufficiently different contexts s.t. there was no risk of confusing them. This what we see in case of professional jargon, for example. I also wonder whether one could actually quantify such prediction.
If labels that are both short and phonetically consistent with a given language are really in such short supply, why aren’t they all already occupied? Why were you able to come up with a word like ‘wiggin’, that seems to be consistent with English phonetics, that doesn’t already mean something? -- This introduces the concept of phonetic redundancy in languages. It would actually be impractical to occupy all shortest syllable combinations, because it would make it impossible or require too much effort to correct errors. People in radiocommunications recognized this phenomenon and devised a number of spelling alphabets, the most commonly known being the NATO phonetic alphabet.
Fixing my predictions now, before going to investigate this issue further (I have Mackay’s book within the hand’s reach and would also like to run some Monte-Carlo simulations to check the results; going to post the resolution later):
a) It seems that we ought to treat the results differently, because the second researcher in effect admits to p-hacking his results. b) But on the other hand, what if we modify the scenario slightly: suppose we get the results from both researchers 1 patient at a time. Surely we ought to update the priors by the same amount each time? And so by the time we get the 100th individual result from each researcher, the priors should be the same, even if we then find out that they had different stopping criteria.
My prediction is that argument a) turns out to be right and argument b) contains some subtle mistake.
Update: a) is just wrong and b) is right, but unsatisfying because it doesn’t address the underlying intuition which says that the stopping criterion ought to matter. I’m very glad that I decided to investigate this issue in full detail and run my own simulations instead of just accepting some general principle from either side.
MacKay presents it as a conflict between frequentism vs bayesianism and argues why frequentism is wrong. But I started out with a bayesian model and still felt that motivated stopping would have some influence. I’m going to try to articulate the best argument why the stopping criterion must matter and then explain why it fails.
First of all the scenario doesn’t describe exactly what the stopping criterion was. So I made up one: The (second) researcher treats patients and gets the results one at a time. He has some particular threshold for the probability that the treatment is >60% effective and he is going to stop and report the results the moment the probability reaches the threshold. He derives this probability by calculating a beta distribution for the data and integrating it from 0.6 to 1. (for those who are unfamiliar with the beta distribution, I recommend this excellent video by 3Blue1Brown) In this case the likelihood of seeing the data given underlying probability is given by beta , and the probability that treatment is >60% effective is .
Now the argument: motivated stopping ensures that we don’t just get 70 successes and 30 failures. We have an additional constraint that after each of the 99 outcomes for treatment the probability is strictly and only after the 100th patient it reaches . Surely then, we must modify to reflect this constraint. And if the true probability was really >60%, then surely there are many Everett branches where the probability reaches before we ever get to the 100th patient. If it really took so long, then it must be because it’s actually less likely that the true probability is >60%.
And indeed, the likelihood of seeing 70 successes and 30 failures with such stopping criterion is less than is initially given by . BUT! The constraint is independent of the probability ! It is purely about the order in which the outcomes appear. In other words, it changes the constant , which originally indicated the total number of all different ways to order 70 positive and 30 negative instances. And this constant reduces the likelihood for every probability equally! It doesn’t reduce it more in universes where compared to where . This means that the shape of the original distribution stays the same, only the amplitude changes. But because we condition on seeing 70 successes and 30 failures anyway, this means that the area under the curve must be equal to 1. So we have to re-normalize , and it comes out as again!
Another way to think about it is that the stopping criterion is not entangled with the actual underlying probability in a given universe. There is zero mutual information between the stopping criterion and . And yes, if this was not the case, if for example, the researcher had decided that he would also treat one more patient after reaching the threshold and only publish the results if this patient recovered (but not mention them in the report), then it would absolutely affect the results, because a positive outcome for the patient is more likely in universes where . But then it also wouldn’t be purely about his state of mind, we would have an additional data point.
As Sam Harris points out, the illusion of free will is itself an illusion. It doesn’t actually feel like you have free will if you look closely enough. So then why are we mistaken about things when we don’t examine them closely enough? Seems like a too-open-ended question.
That seems unlikely. There is already a certain difficulty in showing that illusion of free will is an illusion. “It seems like you have free will, but actually, it doesn’t seem.”—The seeming is self-evident, so what does it mean to say that something actually doesn’t seem if it feels like it seems. As far as I understand it, it’s not like it doesn’t really seem so, but you’re mistaken about it and think that it actually seems so, and then mindfulness meditation clears up that mistake for you and you stop thinking that it seems that you have free will. Instead, you observe that seeming itself just disappears. It stops seeming that you have free will.
So now we come to your suggestion: “It seems(level 2.) like the seeming(lvl 1.) disappears, but actually, it doesn’t seem(lvl 2.) like the seeming(lvl 1.) disappears.”—but once again, the seeming(lvl 2.) is self-evident. So you’d need to come up with some extraordinary circumstances which are associated with more mental clarity to show that that seeming(lvl 2.) also disappears. But this is unlikely, because the concept of free will is already incoherent, so more mental clarity shouldn’t point you towards it.
Conservation laws or not, you ought to believe in the existence of the photon because you continue having the evidence of its existence—it’s your memory of having fired the photon! Your memory is entangled with the state of the universe, not perfectly, but still, it’s Bayesian evidence. And if your memory got erased, then indeed, you’d better stop believing that the photon exists.
If reductionism was wrong then I would expect reductionist approaches to be ineffective. Every attempt at gaining knowledge using a reductionist framework would fail do discover anything new, except by accident on very rare occasions. Or experiments would fail to replicate because the conservation of energy was routinely violated in unpredictable ways.
Of course it doesn’t work for problems where the objects in question are already fundamental and cannot be reduces any further. But that’s what I meant in the original post—reductionist frameworks would fail to produce any new insights if we were already at the fundamental level.
Care to elaborate? Also, that’s not really an exception, but a boundary—it’s exactly what you would expect if there are finitely many layers of composition i.e. the world is not like an infinite fractal.
Something felt off about this example and I think I can put my finger on it now.
My model of the world gives the event with the blue tentacle probability ~0. So when you ask me to imagine it, and I do so, what it feels like to me like I’m coming up with a new model to explain it, which gives a higher probability to that outcome than my current model does. This seems to be the root of the apparent contradiction, it appears that I’m violating the invariant. But I don’t think that that’s what actually happening. Consider this fictional exchange:
EY: Imagine that you have this particular gaussian model. Now suppose that you find yourself in a situation that is 50 SD’s away from the median. How do you explain it?
Me: Well, my hypothesis is that...
EY: Wrong! That scenario is too unlikely, if the model has something to say about, then it must be wrong and irrational.
Me: No! You asked me to suppose this incredibly unlikely scenario, which is exactly what I did. I didn’t conclude “EY is asking me to consider something that’s too unlikely, ah, he’s trying to trick me, therefore I am not going to imagine the scenario on the count that it’s impossible!” because this is an impossible conclusion from inside the model.
I have limited resources, so I just don’t bother pre-computing all details of my model that are too unlikely to matter. But if this scenario actually came up in real life, I would be able to fill in the missing details retroactively. That doesn’t mean that my model assumes more than 100% total probability, because I’m already reserving a bit of probability mass for unknown unknowns. And I needn’t worry about such scenarios now, because they’re too unlikely and there too many similarly unlikely scenarios. I just can’t be meaningfully concerned about them all.
Originally I thought of an exception where the thing that we don’t know was a constructive question. e.g. given more or less complete knowledge or material science, how to we construct a decent bridge? But it’s an obvious limitation, no self-proclaimed reductionist would actually try to apply reductionism in such situation.
It seems to me that you’re describing a reverse scenario: suppose we have an already constructed object, and want to figure out how works—can reductionism still be used? I’d still say yes.
Take an airplane, for example. Knowing relevant laws of physics and looking at just the airplane, you can’t actually say predict whether it’s going to fly to New Your or Chicago. You need to incorporate the pilot into the model. And the pilot is influenced by human psychology, economics, etc. So on one hand you have the airplane as a concrete physical object, and one the other hand you have the role that airplanes of that type play in human society. BUT! By looking at just the physical properties, you can still infer a great deal about how it’s used.
This too applies to money. Physical manifestations are not actually completely arbitrary—they are either valuable in themselves—hides, grain, salt etc. or they have properties which make them suitable as value tokens—relatively durable and difficult to counterfeit either through scarcity of raw materials or difficulty in manufacturing. There is not as much to say about the physical properties of money compared to airplanes, but the difference is quantitative, not qualitative.
So we’re left with questions about human society. How do humans actually use these objects? Well, it’s often impractical to apply reductionism but it’s still possible in principle. We just don’t know enough yet, or it would be computationally intractable, or it would be unethical etc. And of course, a lot has already been learned though application of reductionism to human psychology.
But is the Occam’s Razor really circular? The hypothesis “there is no pattern” is strictly simpler than “there is this particular pattern”, for any value of ‘this particular’.. Occam’s Razor may expect simplicity in the world, but it is not the simplest strategy itself.
Edit: I’m talking about the hypothesis itself, as a logic sequence of some kind, not that, which the hypothesis asserts. It asserts maxentropy—the most complex world.
It seems that the mistake that people commit is imagining the the second scenario is a choice between 0.34*24000 = 8160 and 0.33*27000 = 8910. Yes, if that was the case, then you could imagine a utility function that is approximately linear in the region 8160 to 8910, but sufficiently concave in the region 24000 to 27000 s.t. the difference between 8160 and 8910 feels greater than between 24000 and 27000… But that’s not the actual scenario with which we are presented. We don’t actually get to see 8160 or 8910. The slopes of the utility function in the first and second scenarios are identical.
“Oh, these silly economists are back at it again, asserting that my utility function ought to be linear, lest I’m irrational. Ugh, how annoying! I have to explain again, for the n-th time, that my function actually changes the slope in such a way that my intuitions make sense. So there!” ← No, that’s not what they’re saying! If you actually think this through carefully enough, you’ll realize that there is no monotonically increasing utility function, no matter the shape, that justifies 1A > 1B and 2A < 2B simultaneously.
Richard Feynman once said that if you really understand something in physics you should be able to explain it to your grandmother. I believed him.
Curiously enough, there is a recording of an interview with him where he argues almost exactly the opposite, namely that he can’t explain something in sufficient detail to laypeople because of the long inferential distance.
I thought of a slightly different exception for the use of “rational”: when we talk about conclusions that someone else would draw from their experiences, which are different from ours. “It’s rational for Truman Burbank to believe that he has a normal life.”
Or if I had an extraordinary experience which I couldn’t communicate with enough fidelity to you, then it might be rational for you not to believe me. Conversely, if you had the experience and tried to tell me, I might answer with “Based only on the information that I received from you, which is possibly different from what you meant to communicate, it’s rational for me not to believe the conclusion.” There I might want to highlight the issue with fidelity of communication as a possible explanation for the discrepancy (the alternative being, for example, that the conclusion is unwarranted even if the account of the event is true and compete).
This feels very important.
Suppose that something *was* deleted. What was it? What am I failing to notice?
Maybe learning to ‘regenerate’ the knowledge that I currently possess is going to help me ‘regenerate’ the knowledge that ‘was deleted’.