Someone with no personal experience of suffering should also be moved by that consideration.
That sounds like a fantastic reason for someone with that experience to post it, as occurred here, as a way to explain what it is like to others.In fact, only the existence of suffering for some concrete individual justifies the abstract conclusion of altruism. Without that concrete level, the abstraction is hypothetical, and should not provide the same level of reason to be altruistic.
Update: we did this, I bought shares, we’ll see how it goes.
Extreme, in this context, was implying far from the consensus expectation. That implies both “seen as radical” and “involving very high [consensus] confidence [against the belief].” Contra your first paragraph, I think, I claim that this “extremeness” is valid Bayesian evidence for it being false, in the sense that you identify in your third paragraph—it has low prior odds. Given that, I agree that it would be incorrect to double-count the evidence of being extreme. But my claim was that, holding “extremeness” constant, the newness of a claim was independent reason to consider it as otherwise more worthy of examination, (rather than as more likely,) since VoI was higher / the consensus against it is less informative. And that’s why it doesn’t create a loop in the way you suggested. So I wasn’t clear in my explanation, and thanks for trying to clarify what I meant. I hope this explains better / refined my thinking to a point where it doesn’t have the problem you identified—but if I’m still not understanding your criticism, feel free to try again.
Want to sell me USDC on there in exchange for paypal, so I can bet? (I’ll gladly pay a 2% “commission” for, say, $200 in USDC.)
It’s a pain to redo, but can someone add Ought embedded predictions to all of these?
https://forecast.elicit.org/binary(Alternatively/additionally, can they all be on Metaculus?)
Relatedly and perhaps even more fundamentally, the basic discipline of thinking about a system and implementing a mathematical model or simulation to explore these topics, which drove the insights you mention. And in many ways, it’s easier to test without worrying about people gaming the system, because you can give new examples and require them to actually explore the question.
That’s fine, but choosing the question set to give the self-motivated children on which you provide the instant computer driven feedback is the same type of question; what is it that we want the child interested in X to learn?
Concretely, my 8 year old son likes math. He’s fine with multiplication and division, but enjoys thinking about math. If I want him to be successful applying math later in life, should I start him on knot theory, pre-algebra equation solving, adding and subtracting unlike fractions, or coding in python? I see real advantages to each of these; proof-based thinking and abstraction from concrete to theoretical ideas, more abstract thinking about and manipulation of numbers, getting ahead of what he’ll need next to continue at the math he will need to learn, or giving him other tools that will expand his ability to think and apply ideas, respectively.
I’d love feedback about which of these (or which combination of these) is most likely to ensure he’s learning the things that are useful in helping him apply math in a decade, but I can’t get useful feedback without trying it on large samples over the course of decades. Or, since I don’t live in Dath Ilan, I can use my best judgement and ask others for feedback in an ad-hoc fashion.
Partly agree with your criticism of the quoted claim, but there are two things I think you should consider.
First, evaluating tests for long-term outcomes is fundamentally hard. The extent to which a 5th grade civics or math test predicts performance in policy or engineering is negligible. In fact, I would expect that the feedback from test scores in determining what a child focuses on has a far larger impact on a child’s trajectory than the object level prediction allows.Second, standardizing tests greatly reduces cost of development, and allows larger sample sizes for validation. For either reason alone, it makes sense to use standardized tests as much as possible.
12. Netanyahu is still Israeli PM: 40%This is the PredictIt line for him on 6⁄30, and Scott’s predicting this out to January 1. I’m guessing that he didn’t notice? Otherwise, given how many things can go wrong, it’s a rather large disagreement – those wacky Israelis have elections constantly. I’m going to sell this down to 30% even though I have system 1 intuitions he’s not going anywhere. Math is math.
12. Netanyahu is still Israeli PM: 40%
This is the PredictIt line for him on 6⁄30, and Scott’s predicting this out to January 1. I’m guessing that he didn’t notice? Otherwise, given how many things can go wrong, it’s a rather large disagreement – those wacky Israelis have elections constantly. I’m going to sell this down to 30% even though I have system 1 intuitions he’s not going anywhere. Math is math.
I would buy at this price, probably up to 50%, but there are some wrinkles to how it gets resolved. At least 45% of the population really really wants him as PM, and the other 55% doesn’t have a favorite, but 2/3rds are very strongly opposed to Netanyahu. If he is temporarily no longer PM due to the sharing agreement during the run-up to another election, but then wins, does that resolve yes, or no? (This seems remarkably plausible.)But as a disclaimer, I’m bad (and somewhat poorly calibrated) at predicting things I’m emotionally invested in, and, well...
I also just requested this on reddit
Also just requested on reddit: https://www.reddit.com/r/Scholar/comments/mtwl4d/chapter_k_hoskin_1996_the_awful_idea_of/
Request: “K. Hoskin (1996) The ‘awful idea of accountability’: inscribing people into the measurement of objects. In Accountability: Power , Ethos and the Technologies of Managing, R. Munro and J. Mouritsen (Eds). London, International Thomson Business Press, and references therein.”(Cited by: Strathern, Marilyn (1997). “‘Improving ratings’: audit in the British University system”. European Review. John Wiley & Sons. 5 (3): 305–321. doi:10.1002/(SICI)1234-981X(199707)5:3<05::AID-EURO184>3.0.CO;2-4.)
See Google Books, and Worldcat (Available in many UK universities, incl. Cambridge & Oxford, and in the NYPL, at MIT, etc.)
Context: Looking for sources about the history of Goodhart’s law, esp. as “quoted”/ paraphrased, seemingly by Strathern.From Strathern’s paper:”When a measure becomes a target, it ceases to be a good measure. The more a 2.1 examination performance becomes an expectation, the poorer it becomes as a discriminator of individual performances. Hoskin describes this as ‘Goodhart’s law’, after the latter’s observation on instruments for monetary control which lead to other devices for monetary flexibility having to be invented. However, targets that seem measurable become enticing tools for improvement.”
Noting the obvious connection to Goodhart’s law—and elsewhere I’ve described the mistake of pushing to maximize easy-to-measure / cognitively available items rather than true goals.
Yeah, that’s true. I don’t recall exactly what I was thinking. Perhaps it was regarding time-weighting, and the difficulty of seeing what your score will be based on what you predict—but the Metaculus interface handles this well, modulus early closings, which screw lots of things up. Also, log-scoring is tricky when you have both continuous and binary outcomes, since they don’t give similar measures—being well calibrated for binary events isn’t “worth” as much, which seems perverse in many ways.
In many cases, yes. But for some events, the “obvious” answers are not fully clear until well after the event in question takes place—elections, for example.
About 20% of Americans develop skin cancer during their lifetime, and the 5-year overall survival rate for melanoma is over 90 percent. Taking this as the mortality risk, i.e. ignoring timing and varied risk levels, it’s a 2% risk of (eventual) death.But risk of skin cancer depends on far more than sun exposure—and the more important determinant is frequency of sunbathing below age 30. Other factors that seem to matter are skin color, skin response (how much you burn,) weight, and family history of cancers.
re: “Get this wrong” versus “the balance should be better,” there are two different things that are being discussed. The first is about defining individual questions via clear resolution criteria, which I think is doe well, and the second is about defining clear principles that provide context and inform what types of questions and resolution criteria are considered good form.A question like “will Democrats pass H.R.2280 and receive 51 votes in the Senate” is very well defined, but super-narrow, and easily resolved “incorrectly” if the bill is incorporated into another bill, or if an adapted bill is proposed by a moderate Republican and passes instead, or passed via some other method, or if it passes but gets vetoed by Biden. But it isn’t an unclear question, and given the current way that Metaculus is run, would probably be the best way of phrasing the question. Still, it’s a sub-par question, given the principles I mentioned. A better one would be “Will a bill such as H.R.2280 limiting or banning straw purchases of firearms be passed by the current Congress and enacted?” It’s much less well defined, but the boundaries are very different. It also uses “passed” and “enacted”, which have gray areas. At the same time, the failure modes are closer to the ones that we care about near the boundary of the question. However, given the current system, this question is obviously worse—it’s harder to resolve, it’s more likely to be ambiguous because a bill that does only some of the thing we care about is passed, etc.
Still, I agree that the boundaries here are tricky, and I’d love to think more about how to do this better.
I haven’t said, and I don’t think, that the majority of markets and prediction sites get this wrong. I think they navigate this without a clear framework, which I think the post begins providing. And I strongly agree that there isn’t a slam-dunk-no-questions case for principles overriding rules, which the intro might have implied too strongly. I also agree with your point about downsides of ambiguity potentially overriding the benefits of greater fidelity to the intent of a question, and brought it up in the post. Still, excessive focus on making rules on the front end, especially for longer-term questions and ones where the contours are unclear, rather than explicitly being adaptive, is not universally helpful. And clarifications that need to change the resolution criteria mid-way are due to either bad questions, or badly handled resolutions. At the same time, while there are times that avoiding ambiguity is beneficial, there are also times when explicitly addressing corner cases to make them unambiguous (“if the data is discontinued or the method is changed, the final value posted using the current method will be used”) makes the question worse, rather than better. Lastly, I agree that one general point I didn’t say, but agree with, was that “where the spirit and letter of a question conflict, the question should be resolved based on the spirit.” I mostly didn’t make an explicit case for this because I think it’s under-specified as a claim. Instead, the three more specific claims I would make are: 1) When the wording of a question seems ambiguous, the intent should be an overriding reason to choose an interpretation.2) When the wording of a question is clear, the intent shouldn’t change the resolution.
As an aside, I find it bizarre that Economics gets put at 9 - I think a review of what gets done in top econ journals would cause you to update that number down by at least 1. (It’s not usually very bad, but it’s often mostly useless.) And I think it’s clear that lots of Econ does, in fact, have a replication crisis. (But we’ll if see that is true as some of the newer replication projects actually come out with results.)