For comparison: imagine some medical researchers are interested in whether a particular medicine helps with a particular medical condition, so they set up a placebo controlled trial. A bunch of people with the medical condition all get their symptoms tested, then they flip a coin and half get pills with the medicine while the other half get sugar pills, and they don’t know whether they have the real pills. Then, some time later, they all get their symptoms tested again.
Now, imagine that I’m interested in “placebo effects”—I want to see if the ritual of taking sugar pills which you think might be medicine improves people’s health, or causes side effects, and I want to piggyback on this medical trial. I could just look at the pre vs post results for the set of people who got the sugar pills, but unfortunately this medical condition varies over time so I can’t disentangle effects of the pill-taking ritual from changes over time. I wish the study had a third “no-pill” group who (knowingly) didn’t get any treatment, in addition to the medical pill group and the inert pill group. Then I could just compare the results of the sugar pill group to the no pill group. But it doesn’t.
So I have the clever idea of getting the researchers to add a question to the tests at the end of the study, where they ask the patients whether they think they got the medicine pill or the sugar pill. That gives me a nice 2x2 design, where patients differ both in whether they got the medicine pill or the sugar pill, and separately in whether they believe they got the medicine pill or the sugar pill. So I can look separately at each of the 4 groups to see how much their condition improved, and what side effects they got. Changes that are associated with beliefs, I can claim, are based on the psychological effects of this pill taking ritual rather than the physiological effects of the substance they ingested.
This is a terrible study design. Who’s going to believe they got the real medicine? Well, people whose condition improved will tend to think they must’ve gotten the real medicine. And people who noticed physiological states like nausea or dry mouth will tend to think they’ve gotten the real medicine. This study design will say that improved condition & nausea are caused by people’s beliefs about whether they got the medicine, when in reality it’s the reverse: the beliefs are caused by these physical changes.
If I’m especially meddlesome, I might even tell the original researchers that they should use this 2x2 design to evaluate the original study. Instead of just comparing the outcomes for the medicine pill group and the sugar pill group, they should compare the outcomes while controlling for people’s beliefs about whether they got the medicine. That would mess up their study. It would be asking how effective the medicine is, after removing any effects that allowed patients to realize that they’d gotten the medicine (as if belief-entangled effects couldn’t be physiological effects of the substance).
It’s tricky to run studies with beliefs as a variable, because beliefs have causes, so you’re setting yourself up to have confounds. I haven’t looked that closely at this study, but here are some possibilities:
Severity: people who had covid but believed that they didn’t had mild symptoms. So ‘severe cases have more long-term symptoms than mild/asymptomatic cases’ would look like ‘covid+belief leads to more reported long-term symptoms than covid without belief’.
Other illnesses: people who didn’t have covid but thought the did had some other illness like flu or pneumonia. If there’s long flu, then the long-term symptoms could be from that.
Long-term symptoms: a person who thinks that they probably just have a cold and not covid, but then is still fatigued a month later, might conclude that actually it probably was covid. So medium-to-long-term symptoms can cause belief, rather than belief causing long-term symptoms.
Testing inaccuracy: if the test that they’re using to establish the ground truth of whether a person had covid isn’t that accurate, then people who they’re counting as ‘covid but no belief’ might actually be false positives, and people who they’re counting as ‘no covid but yes belief’ might be false negatives.
Hypochondria: people who are to imagine that their health is worse than it actually is might mistakenly believe that they had covid (when they didn’t) and also imagine that they have long-term symptoms like fatigue or difficulty breathing. If people who did get covid have similar reported long-term symptoms, that means that the actual long-term symptoms of people who had covid are as bad as the imagined level of symptoms among people who imagined they had covid.
Denial: the reverse of hypochondria—people who say they’re fine even when they have some health symptoms might say that they didn’t have covid even though they did, and then downplay their long-term symptoms.
Trolling: if data slipped into the study from any people who find it funny to give extreme answers, they would look like hypochondriacs, claiming to have covid & long-term symptoms even if they didn’t have covid.
The first few of these possibilities are cases where facts about the world influence beliefs, and those facts also influence long-term symptoms. The last few of these possibilities are where the person’s traits influence their beliefs (or stated beliefs), and those traits also influence their reports of what long-term symptoms they’ve had.
If you wanted to independently assess the effects of getting covid and believing that you had covid, ideally (for scientific rigor unconstrained by ethics or practicality) you’d randomly assign some people to get covid or not and also randomly assign some people to believe they had covid or not (e.g. by lying to them). If you couldn’t have perfect random assignment + blinding, then you’d want to measure a whole bunch of other variables to account for them statistically. In reality, without anything like random assignment, who gets covid is maybe close enough to random for an observational study to work, especially if you control for some high-level variables like age. Beliefs about whether you had covid are heavily entangled with relevant stuff, in a way that makes it really hard to study them as an independent variable.
Is there good reason to think that this study overcomes these problems?
Chance that Omicron has a 100% or bigger transmission advantage in practice versus Delta: 65% → 70%.The new study says 161% in vaccinated people, 266% in the boosted, 17% in the unvaccinated. If you average that out, it’s higher than 100% in the populations we care about, but it’s somewhat close. Thus I’m creeping back up a bit.
Chance that Omicron has a 100% or bigger transmission advantage in practice versus Delta: 65% → 70%.
The new study says 161% in vaccinated people, 266% in the boosted, 17% in the unvaccinated. If you average that out, it’s higher than 100% in the populations we care about, but it’s somewhat close. Thus I’m creeping back up a bit.
Can you show your work? My quick BOTEC (making up plausible numbers for the other inputs) came out to a bit under 100%.
Me too, except my P100 was wearing a cloth mask rather than a surgical mask.
1. Perform in a context with low standards, such that even your current skill level generates a somewhat positive reaction.
Examples:a. be 9 years oldb. do your magic trick for a 2-year-oldc. mention to your friends that you’ve been working on learning magic, they say they want to see a trick, you tell them it’s not very good yet, they cajole you into showing them a trick
Challenges: finding these contexts, having the feedback still be sufficiently correlated with the quality of your performance, finding enough of these contexts to have repeated feedback loops
2. Perform in a context with richer feedback, which tells you how particular aspects of your performance went, rather than just giving a single overall rating of how the performance went on the whole.
Examples:a. Talk to audience members after the show who can tell you more about their experience (“the trick got me, but I didn’t really feel much tension in the buildup to it”)b. Perform for a more skilled magician who also has some skill at training new magicians, and get feedback from themc. Videorecord yourself performing and watch it to study how different aspects of your performance went
3. Have some models of different subskills or aspects of performing, and some training approaches to work on different ones.
Examples:a. Read or watch a guide to becoming a musician which breaks things down into subskills & provides training exercisesb. Think about what different subskills are involved, try to build your own models & practice the things that seem relevantc. Pay attention to your performance as you’re performing / as you’re practicing / as you’re watching videos of yourself do magic, try to notice different subskills or moves within your performance
RE Robin Hanson’s post on the health insurance study, ‘Most of the p-values weren’t statistically significant, therefore there’s no effect’ is not reasoning that I generally find very convincing.
I’d rather have them combine all the health outcome measures into a single estimate of overall health. Then we can look at the point estimate of the difference in health between getting insurance vs. not, test if it’s statistically significant, look at whether the effect size is too tiny to care much about or meaningful, and look at whether all of the effect sizes in the 95% confidence interval are too tiny to care about or if some of them are meaningful. Maybe even do a cost-benefit estimate, trying to get the units into something more like DALYs per dollar.
(IIRC, the Oregon study that Hanson mentions in his post did something kinda like this, and the point estimate was that people who got insurance had better health, this was not statistically significant, but the point estimate was large enough to care about.)
Looking at the new paper, it does do a few different analyses of effects on health outcomes, but doesn’t say much about them. They call health outcomes a secondary measure and give just these two paragraphs of results in the text of the paper:
Access to insurance had few significant effects on health in either survey (Table 5). Having measured (a) 3 parameters (direct/indirect/total) for (b) 3 ITT and one TOT effect for (c) 82 specified outcomes over 2 surveys, only 3 (0.46% of all estimated coefficients concerning health outcomes) were significant after multiple-testing adjustments. (As Table A8 shows, 55 parameters (8.38%) are significant if we do not adjust for multiple-testing.) We cannot reject the hypothesis that the distribution of p-values from these estimates is consistent with no differences (P=0.31). We also find no effect of access on our summary index of health outcomes (Table A6 and Table A7).Care should be taken in interpreting the insignificant health effects observed. Perhaps the effect of hospital care on measured outcomes is too small to translate into health improvements that we have power to detect despite our substantial sample size (Das, Hammer et al. 2008). Moreover, confidence intervals reported in Table A6 and Table A7 suggest that medically significant effects for many outcomes cannot be ruled out. On average, the absolute value of an estimated ITT (TOT) effect for an outcome equals 11% (8.8%) the standard deviation of the outcome. Finally, given the low premiums for RSBY insurance, it would require a rather precise nearly zero estimate of health effects to rule out that government spending on freely provided insurance was not cost-effective.
So they did calculate a single overall effect on health, which they say elsewhere was “the average of z-scores for individual health outcomes” (that seems like an adequate way to do it but not the best, since it ignores the importance & noisiness of each measure, and the correlations between measures). They say that the effect on that overall health index was not statistically significant, but they don’t say anything about the effect size or confidence interval (maybe there’s something in the appendix but I can’t find find anything about the index in Table A6 or A7).
This seems related to philosophy of science stuff, where updating is about pitting hypotheses against each other. In order to do that you have to locate the leading alternative hypotheses. It doesn’t work well to just pit a hypothesis against “everything else” (it’s hard to say what p(E|not-H) is, and it can change as you collect more data). You need to find data that distinguishes your hypothesis from leading alternatives. An experiment that favors Newtonian mechanics over Aristotelian mechanics won’t favor Newtonian mechanics over general relativity.
I think I’ve followed the basic argument here? Let me try a couple examples, first a toy problem and then a more realistic one.
Example 1: Dice. A person rolls some fair 20-sided dice and then tells you the highest number that appeared on any of the dice. They either rolled 1 die (and told you the number on it), or 5 dice (and told you the highest of the 5 numbers), or 6 dice (and told you the highest of the 6 numbers).
For some reason you care a lot about whether there were exactly 5 dice, so you could break this down into two hypotheses:
H1: They rolled 5 diceH2: They rolled 1 or 6 dice
Let’s say they roll and tell you that the highest number rolled was 20. This favors 5 dice over 1 die, and to a lesser degree it favors 6 dice over 5 dice. So if you started with equal (1/3) probabilities on the 3 possibilities, you’ll update in favor of H1. Someone who also started with a 1⁄3 chance on H1, but who thought that 1 die was more likely than 6 dice, would update even more in favor of H1. And someone whose prior was that 6 dice was more likely than 1 die would update less in favor of H1, or even in the other direction if it was lopsided enough.
Relatedly, if you repeated this experiment many times and got lots of 20s, that would eventually become evidence against H1. If the 100th roll is 20, then that favors 6 dice over 5, and by that point the possibility of there being only 1 die is negligible (if the first 99 rolls were large enough) so it basically doesn’t matter that the 20 also favors 5 dice over 1. This seems like another angle on the same phenomenon, since your posterior after 99 rolls is your prior for the 100th roll (and the evidence from the first 99 rolls has made it lopsided enough so that the 20 counts as evidence against H1).
Example 2: College choice. A high school freshman hopes & expects to attend Harvard for college in a few years. One observer thinks that’s unlikely, because Harvard admissions is very selective even for very good students. Another observer thinks that’s unlikely because the student is into STEM and will probably wind up going to a more technical university like MIT; they haven’t thought much yet about choosing a college and Harvard is probably just serving as a default stand-in for a really good school.
The two observers might give the same p(Harvard), but for very different reasons. And because their models are so different, they could even update in opposite directions on the same new data. For instance, perhaps the student does really well on a math contest, and the first observer updates in favor of the student attending Harvard (that’s an impressive accomplishment, maybe they will make it past the admissions filter) while the second observer updates a bit against the student attending Harvard (yep, they’re a STEM person).
You could fit this into the “three outcomes” framing of this post, if you split “not attending Harvard” into “being rejected by Harvard” and “choosing not to attend Harvard”.
The paper is Hypnotic Disgust Makes Moral Judgments More Severe.
Additionally, many of the optimizations that lead to more wins make games more boring, which ultimately costs the entire league money.
This is true of some but not all optimizations. NFL teams punt too often on 4th down, and punting is boring; (in a large set of cases where teams have conventionally decided to punt) keeping your offense on the field to run a play increases your chances of winning and also makes the game more interesting for fans. (Teams have gradually been getting better at these decisions, over the years.)
Another complication is that various people judge the coach (or team) based on process and not just on results, using their own views about which process is best. So there’s a cost to making a decision that other people consider to be a bad decision, even if it maximizes your team’s chances of winning.
If the fans think the coach made a bad decision, they might like the team a bit less, spend less money on the team, or want the coach to be fired.
If the players think the coach made a bad decision, they might be a bit less on board with the what the team is doing or less eager to sign a contract with the team.
If the owner/GM thinks that the coach made a bad decision, or that the fans or players don’t support the coach as much, they might be a bit more likely to fire the coach.
So if we start in a situation where the fans, players, owner/GM, and coach all believe the conventional wisdom about what decisions are good ones, then the coach doesn’t necessarily have much incentive to search for unconventional approaches which are widely seen as bad ideas but actually increase the team’s chances of winning.
Marine Exchange Facebook page has graphs of “container ships at anchor or loitering” for Los Angeles + Long Beach combined. It has been relatively flat in the 70s since mid-October, with an early November dip followed by a bounceback. The peak of 80 was on Oct 24, the current (Nov 8) is 77.
Port of Los Angeles has its own data (current pdf) with POLA Vessels at Anchor; it’s the “historical container vessel activity” which you can get from this page by clicking on the picture that says “Working Container Vessels”. It shows a peak of 40ish from Oct 19-29, which then dropped to 30ish, but is back up to 40 now (Nov 8).
Port of Long Beach has this page listing the container vessels at anchor there. It currently (Nov 9) lists 43 vessels. I can’t find historical data except through internet archive; the most recent archived page shows 31 vessels as of Oct 8.
Some people like Alex Tabarrok and Zeynep Tufekci were writing on covid in a similar style to Ryan Petersen. The US government did eventually wind up adopting some of their sensible recommendations, but it’s hard to track causality since there were more people talking & longer delays before government action.
53% of virtue ethicists one-box (out of those who picked a side).
Seems plausible that it’s for kinda-FDT-like reasons, since virtue ethics is about ‘be the kind of person who’ and that’s basically what matters when other agents are modeling you. It also fits with Eliezer’s semi-joking(?) tweet “The rules say we must use consequentialism, but good people are deontologists, and virtue ethics is what actually works.”
Whereas people who give the pragmatic response to external-world skepticism seem more likely to have “join the millionaires club” reasons for one-boxing.
The survey results page also lists “Strongest correlations” with other questions. If I’m reading the tables correctly for the Newcomb’s Problem results, there were 17 groups (in the target population who gave a particular answer to one of the other survey questions) in which one-boxing was at least as common as two-boxing. In order (of one-boxers minus two-boxers):
Political philosophy: communitarianism (84 vs 67)Semantic content: radical contextualism (most or all) (49 vs 34)Analysis of knowledge: justified true belief (49 vs 34)Response to external-world skepticism: pragmatic (51 vs 37)Normative ethics: virtue ethics (112 vs 100)Philosopher: Quine (33 vs 22)Arguments for theism: moral (22 vs 12)Hume: skeptic (72 vs 63)Aim of philosophy: wisdom (96 vs 88)Philosophical knowledge: none (12 vs 5)Philosopher: Marx (8 vs 2)Aim of philosophy: goodness/justice (73 vs 69)A priori knowledge: [no] (62 vs 58)Consciousness: panpsychism (16 vs 13)External world: skepticism (20 vs 18)Eating animals and animal products: omnivorism (yes and yes) (168 vs 168)Truth: epistemic (26 vs 26)
200 people (100 for each forum)
Minor mathematical correction: in this case, 100+100<200
I suggest renaming the “Incidental anchoring” section to something else, such as “irrelevant anchors” or “transparently random anchors”, since the term “incidental anchoring” is used to refer to something else.
Also, one of the classic 1970s Kahneman & Tversky anchoring studies used a (apparently) random wheel of fortune to generate a transparently irrelevant anchor value—the one on African countries in the UN. When this came up on LW previously, it turned out that Andrew Gelman used it as an in-class demo and (said that he) generally found effects in the predicted direction (though instead of spinning a viscerally random wheel they just handed each student a piece of paper that included the sentences “We chose (by computer) a random number between 0 and 100. The number selected and assigned to you is X = ___”).
A 2008 paper found anchoring effects from these kinds of “incidental environmental anchors”, but then a replication of one of its studies with a much larger sample size found no effect (see “9. Influence of incidental anchors on judgment (Critcher & Gilovich, 2008, Study 2)”).
So that at least says something about why the people running your forecasting workshop thought this would have an effect, and provides some entry points into the published research which someone could look into in more depth, but it still leaves it surprising/confusing that there was such a large difference.
Anna & Val taught goal factoring at the first CFAR workshop (May 2012). I’m not sure if they used the term “goal factoring” at the workshop (the title on the schedule was “Microeconomics 1: How to have goals”), but that’s what they were calling it before the workshop including in passing on LW. Geoff attended the third CFAR workshop as a participant and first taught goal factoring at the fourth workshop (November 2012), which was also the first time the class was called “Goal Factoring”. Geoff was working on similar stuff before 2012, but I don’t know enough of the pre-2012 history to know if there was earlier cross-pollination between Geoff & CFAR folks.
Critch developed aversion factoring.