# rossry

Karma: 449
• 11 Aug 2022 6:22 UTC
2 points
0 ∶ 0

I don’t know. (As above, “When [users] tell you exactly what they think is wrong and how to fix it, they are almost always wrong.”)

A scoring rule that’s proper in linear space (as you say, “scored on how close their number is to the actual number”) would accomplish this—either for scoring point estimates, or distributions. I don’t think it’s possible to extract an expected value from a confidence interval that covers orders of magnitude, so I expect that would work less well.

• I think this argument doesn’t follow:

There is hardly any difference between taking a life and not preventing a death. The end result is mostly the same. Thus, I should save the lives of as many humans as I can.

While “the end result is mostly the same” is natural to argue in terms of moral-consequentialist motivations, this AI only cares about [not killing humans] instrumentally. So what matters is what humans will think about [taking a life] versus [not preventing a death]. And there, there’s a huge difference!

1. Agree that causing deaths that are attributable to the AI’s actions is bad and should be avoided.

2. But if the death was not already attributable to the AI, then preventing it is instrumentally worse than not preventing it, since it risks being found out and raising the alarm (whereas doing nothing is exactly what the hypothetical evaluators are hoping to see).

3. If the world is a box for evaluation, I’d expect the evaluators to be roughly equally concerned with [AI takes agentic actions that cause people to unexpectedly not die] and [AI takes agentic actions that cause people to unexpectedly die]. Either case is a sign of misalignment (unless the AI thinks that its evaluators tried to make it a save-and-upload-people maximizer, which seems unlikely given the evidence).

4. If the world is not a box for exploration, then [AI action causes someone to suspiciously die] is more plausibly the result of “oops it was an accident” than is [AI action causes someone to suspiciously not die]. The former is more likely to make the hostilities start, but the latter should raise suspicions faster, in terms of Bayesian evidence. So again, better not to save people from dying, if there’s any chance at all of being found out.

Thoughts? What am I missing here?

• 7 Aug 2022 20:48 UTC
1 point
0 ∶ 0

Agreed on all points!

In particular, I don’t have any disagreement with the way the epistemic aggregation is being done; I just think there’s something suboptimal in the way the headline number (in this case, for a count-the-number-of-humans domain) is chosen and reported. And I worry that the median-ing leads to easily misinterpreted data.

For example, if a question asked “How many people are going to die from unaligned AI?”, and the community’s true belief was “40% to be everyone and 60% to be one person”, and that was reported as “the Metaculus community predicts 9,200 people will die from unaligned AI, 10% as many as die in fires per year”, that would...not be a helpful number at all.

You’re right that dates have their own nuance—whether it’s AGI or my food delivery, I care about the median arrival a lot more than the mean (but also, a lot about the tails!).

And so, in accordance with the ancient wisdom, I know that there’s something wrong here, and I don’t presume to be able to find the exact right fix. It seems most likely that there will have to be different handling for qualitatively different types of questions—a separation between “uncertainty in linear space, aggregated in linear space” (ex: Net migration to UK in 2021), “uncertainty in log space, importance in quantiles” (ex: AGI), “uncertainty in log space, importance in linear space” (ex: Monkeypox). The first two categories are already treated differently, so it seems possible for the third category to be minted as a new species of question.

Alternatively, much of the value could come from reporting means in addition to medians on every log question, so that the predictor and the consumer can each choose the numbers that they find most important to orient towards, and ignore the ones that are nonsensical. This doesn’t really solve the question of the incentives for predictors, but at least it makes the implications of their predictions explicit instead of obscured.

# Me­tac­u­lus and medians

6 Aug 2022 3:34 UTC
18 points
• Not taking extrapolation far enough!

4 hours ago, your expected value of a point was $0. In an hour, it increased to$0.2, implying a ~20% chance it pays $1 (plus some other possibilities). By midnight, extrapolated expected value will be$4.19, implying a ~100% chance to pay $1, plus ~45% chance that they’ll make good on the suggestion of paying$10 instead, plus some other possibilities...

• I’m confused why these would be described as “challenge” RCTs, and worry that the term will create broader confusion in the movement to support challenge trials for disease. In the usual clinical context, the word “challenge” in “human challenge trial” refers to the step of introducing the “challenge” of a bad thing (e.g., an infectious agent) to the subject, to see if the treatment protects them from it. I don’t know what a “challenge” trial testing the effects of veganism looks like?

(I’m generally positive on the idea of trialing more things; my confusion+comment is just restricted to the naming being proposed here.)

• Oh, yeah, I can’t vouch for /​ walk through the operations side (not having done it myself). I have had the misfortune of looking at ways to get a Medallion certification outside the US, and it’s not pretty (I failed).

• I don’t know.

It’s worth noting that the terms are (intentionally?) designed to be generous and a pay over market-rate. I suspect this is a feature, not a bug, and the terms are intended to be a mild subsidy to the buyer. The US has many policies that subsidize US citizens and exclude non-Americans; this one doesn’t stand out to me as being particularly unusual. (I don’t particularly endorse this policy posture, but I note that it exists.)

Brainstorming a bit, it seems plausible that the program could include non-citizen taxpayers. If it was truly open to non-taxpayers, then it would amount to a subsidy of non-residents with citizen+resident tax dollars, which the US government is mostly opposed to, as policy.

• I don’t know.

It’s worth noting that the terms are (intentionally?) designed to be generous and a pay over market-rate, and this would be harder to do if there were a \$10mln/​person/​year cap and most of the benefit would flow to wealthy Americans who can finance their debt investments with secured borrowing at scale.

If I had to guess, I’d note that this is pretty similar to contribution/​person/​yr caps on other savings methods the government subsidizes, eg, with tax advantages -- 401(k) accounts and IRA accounts. Insofar as it’s primarily a social scheme to incentivize inflation-protected saving (and probably should not be a primary investment for most people in most conditions), it seems plausible that the cap is intended to target the benefit-per-unit-cost of the program to a wider set of people.

# What on Earth is a Series I sav­ings bond?

11 Dec 2021 12:18 UTC
8 points
• I’m sorry—I don’t understand how your comment responds to mine. I pointed to the fact that Omicron outcompeting Delta without being descended from Delta indicated that a successor to Omicron could perhaps not be descended from Omicron. In particular, I agree with you that Omicron will become the dominant variant almost everywhere.

One minor detail: It is implausible that Omicron’s competitive advantage is primarily derived from an increased R0 (that would give it a higher R0 than measles); rather, its observed fitness against the competition is more easily explained by some measure of immunity evasion (which won’t be measured in increased R0).

• Omicron will soon become the dominant variant almost everywhere, so subsequent variants will probably branch off it.

I don’t think you’re wrong, but it is worth noting that Omicron itself violated this guess; it is defended from the original strain, not any other Greek-lettered variant.

• I give the WHO kudos for picking omicron instead of nu. (Actually, I’m pretty shocked that they did something this common-sensical, and notice that I am surprised.) I spent Friday morning (= Thursday evening, US time) talking out loud with colleagues about the new nu variant and after like two attempts to clarify what the f—I actually meant, multiple people independently joked that it was so bad we should just skip nu and go to omicron.

If you’ve only ever discussed it in text, you’re underestimating how bad it is to use “nu” as an adjective in verbal conversation.

• 27 Nov 2021 1:58 UTC
12 points

Both were sent to the hospital but it is unclear whether this was part of a standard procedure or if they were ill enough to need to go.

Testing positive was sufficient to get them sent to the hospital, and they had mandatory PCR testing every ~3 days; this is no evidence about their symptoms.

(I recently went through HK arrival quarantine—in the same hotel, no less—and researched the operating procedure runbook out of personal interest.)

• I think I was unclear. I meant that if you did correctly estimate the number of cases, you’d need mamy times that many courses of medicine “in the system” to make sure that no one worried about running out in their part of the system, so that no one started hoarding where they were. I estimated that about ten times as many cases as you natively needed would about do it.

If our standard is non-scarcity for prophylactic prescription for close contacts, then 10x the expected number of close contacts in your “part of the system”...

(To be clear, this is just a statement about hoarding/​availability dynamics, not about when “things should go back to normal”.)

• Right, I agree that for the update aggregation is better than (but still lossy). And the thing that affects is the weighting in the average—so if then the s don’t matter! (which is a possible answer to your question of “how much aggregation/​disaggregation can you do?”)

But yeah if is very different from then I don’t think there’s any way around it, because the effective could be one or the other depending on what the are.

• The framing of this issue that makes the most sense to me is ” is a function of ”.

When I look at it this way, I disagree with the claim (in “Mennen’s ABC example”) that “[Bayesian updating] is not invariant when we aggregate outcomes”—I think it’s clearer to say the Bayesian updating is not well-defined when we aggregate outcomes.

Additionally, in “Interpreting Bayesian Networks”, the framing seems to make it clearer that the problem is that you used for -- but they’re not the same thing! In essence, you’re taking the sum where you should be taking the average...

With this focus on (mis)calculating , the issue seems to me more like “a common error in applying Bayesian updates”, rather than a fundamental paradox in Bayesian updating itself. I agree with the takeaway “be careful when grouping together outcomes of a variable”—because grouping exposes one to committing this error—but I’m not sure I’m seeing the thing that makes you describe it as unintuitive?

• I would guess that having that many courses of Paxlovid “in the system” would be about an order of magnitude too low for true non-scarcity. (See: how many vaccine doses needed to be in the system before you could assume that there was going to be adequate supply anywhere you might try to look?)

• What’s the breakdown of fields by whether they have a pre-print server or not? (Which of the ones most important to human progress are in the good state?)

I’m most familiar with economics, where there’s no server, but there’s a universally-journal-respected right to publish the pre-print on your personal site, which ends up in the “it’s free if you Google for it” equilibrium in practice.