https://www.elilifland.com/. You can give me anonymous feedback here.
elifland
My forecast is based on historical data from Zillow. I explained my reasoning in the notes. The summary is that housing prices haven’t changed very much in Seattle since April 2019 (on the whole it’s risen 1%). On the other hand, prices in more expensive areas have stayed the same or declined slightly. I settled on a boring median of the price staying the same. Due to how stable the prices have been recently, I think most of the variation will come from the individual house and which neighborhood it’s in, with an outside chance of large Seattle home value fluctuations.
My forecast is based on:
The past trend of increasing qubitsin quantum devices
The current leading commercially available device having 20 qubits
Google’s plans to make Sycamore, which has 54 qubits, available
I don’t have a background in quantum computing, so there’s a chance I’m misinterpreting the question in some way, but I learned a lot doing the research for the forecast (like that there’s a lot of controversy regarding whether quantum supremacy has been achieved yet).
Amusingly, during my research I stumbled upon this Metaculus question about when a >49 qubit quantum computer would be created which resolved ambiguously due to the issue of how well-controlled the qubits are. For the purposes of this forecast I assumed it would resolve based on the raw number of qubits, without adjusting for control.
I must admit I haven’t followed the discussions you’re referring to but if I were to spend more time forecasting this question I would look into them.
I didn’t include effects of COVID in my forecast as it looks like the Zillow Home Value Index for Seattle has remained relatively steady since March (2% drop). I’m skeptical that there are likely to be large effects from COVID in the future when there hasn’t been a large effect from COVID thus far,
A few reasons I could be wrong:
Zillow data is inaccurate or incomplete, or I’m interpreting it incorrectly.
COVID affects variation of individual housing prices much more than the trend of the city as a whole.
The COVID effects will be much bigger in the next half of a year than in the previous. Perhaps there will be a second wave much worse than the market is pricing in which produces large effects.
It looks like people can change their predictions after they initially submit them. Is this history recorded somewhere, or just the current distribution?
We do store the history. You can view them by going https://elicit.org/binary then searching for the question, e.g. https://elicit.org/binary?binaryQuestions.search=Will%20there%20be%20more%20than%2050. Although as noted by Oli, we currently only display predictions that haven’t been withdrawn.
Is there an option to have people “lock in” their answer? (Maybe they can still edit/delete for a short time after they submit or before a cutoff date/time)
Not planning on supporting this on our end in the near future, but could be a cool feature down the line.
Is there a way to see in one place all the predictions I’ve submitted an answer to?
As of right now, not if you make the predictions via LW. You can view questions that you’ve submitted a prediction on via Elicit at https://elicit.org/binary?binaryQuestions.hasPredicted=true if you’re logged in, and we’re working on allowing for account linking so your LW predictions would show up in the same place.
The first version of account linking will be contacting someone at Ought then us manually running a script.
Edit: the first version of account linking is ready, email elifland@ought.org with your LW username and Elicit email and I can link them.
There’s also a Metaculus question about this:
If the user is interested in getting into the top ranks, this strategy won’t be anything like enough.
I think this isn’t true empirically for a reasonable interpretation of top ranks. For example, I’m ranked 5th on questions that have resolved in the past 3 months due to predicting on almost every question.
Looking at my track record, for questions resolved in the last 3 months, evaluated at all times, here’s how my log score looks compared to the community:
Binary questions (N=19): me: -.072 vs. community: -.045
Continuous questions (N=20): me: 2.35 vs. community: 2.33
So if anything, I’ve done a bit worse than the community overall, and am in 5th by virtue of predicting on all questions. It’s likely that the predictors significantly in front of me are that far ahead in part due to having predicted on (a) questions that have resolved recently but closed before I was active and (b) a longer portion of the lifespan for questions that were open before I became active.
Edit:
I discovered that the question set changes when I evaluate at “resolve time” and filter for the past 3 months, not sure why exactly. Numbers at resolve time:
Binary questions (N=102): me: .598 vs. community: .566
Continuous questions (N=92): me: 2.95 vs. community: 2.86
I think this weakens my case substantially, though I still think a bot that just predicts the community as soon as it becomes visible and updates every day would currently be at least top 10.
Anything much worse than that, yes, people could have negative overall scores—which, if they’ve predicted on a decent number of questions, is pretty strong evidence that they really suck at forecasting
I agree that this should have some effect of being less welcoming to newcomers, but I’m curious to what extent. I have seen plenty of people with worse brier scores than the median continuing to predict on GJO rather than being demoralized and quitting (disclaimer: survivorship bias).
- 12 Feb 2021 16:26 UTC; 5 points) 's comment on Covid 2/11: As Expected by (
Someone who is near the top of the leaderboard is both accurate and highly experienced
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily
top 3(edit: top 10)). See my comment below for more details.You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other.
I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
It’s very likely that when the US intelligence community reports on 25. August on their data about the orgins of the COVID-19 they will conclude that it was a lab leak.
Are you open to betting on this? GJOpen community is at 9% that the report will conclude that lab leak is more likely than not, I’m at 12%.
In particular, my actual credence in lab leak is higher (~45%) but I’m guessing the most likely outcome of the report is that it’s inconclusive, and that political pressures will play a large role in the outcome.
My Hypermind Arising Intelligence Forecasts and Reflections
Your prior is for discontinuities throughout the entire development of a technology, so shouldn’t your prior be for discontinuity at any point during the development of AI, rather than discontinuity at or around the specific point when AI becomes AGI? It seems this would be much lower, though we could then adjust upward based on the particulars of why we think a discontinuity is more likely at AGI.
(epistemic status: exploratory)
I think more people into LessWrong in high school—college should consider trying Battlecode. It’s somewhat similar to The Darwin Game which was pretty popular on here and I think generally the type of people who like LessWrong will both enjoy and be good at Battlecode. (edited to add: A short description of Battlecode is that you write a bot to beat other bots at a turn-based strategy game. Each unit executes its own code so communication/coordination is often one of the most interesting parts.)
I did it with friends for 6 years (junior year of high school—end of undergrad), and I think it at least helped me gain legible expertise in strategizing and coding quickly, but plausibly also helped me pick up skills in these areas as well as teamwork.
If any students are interested (I believe PhD students can qualify as well but may not be worth their time), there’s still 2⁄3 weeks left in this year’s game which is plenty of time. If you’re curious to learn more about my experiences with Battlecode, see the README and postmortem here.
Feel free to comment or DM me if you have any questions.
[crossposted from EA Forum]
Reflecting a little on my shortform from a few years ago, I think I wasn’t ambitious enough in trying to actually move this forward.
I want there to be an org that does “human challenge”-style RCTs across lots of important questions that are extremely hard to get at otherwise, including (top 2 are repeated from previous shortform):
Health effects of veganism
Health effects of restricting sleep
Productivity of remote vs. in-person work
Productivity effects of blocking out focused/deep work
Edited to add: I no longer think “human challenge” is really the best way to refer to this idea (see comment that convinced me); I mean to say something like “large scale RCTs of important things on volunteers who sign up on an app to randomly try or not try an intervention.” I’m open to suggestions on succinct ways to refer to this.
I’d be very excited about such an org existing. I think it could even grow to become an effective megaproject, pending further analysis on how much it could increase wisdom relative to power. But, I don’t think it’s a good personal fit for me to found given my current interests and skills.
However, I think I could plausibly provide some useful advice/help to anyone who is interested in founding a many-domain human-challenge org. If you are interested in founding such an org or know someone who might be and want my advice, let me know. (I will also be linking this shortform to some people who might be able to help set this up.)
--
Some further inspiration I’m drawing on to be excited about this org:
Freakonomics’ RCT on measuring the effects of big life changes like quitting your job or breaking up with your partner. This makes me optimistic about the feasibility of getting lots of people to sign up.
Holden’s note on doing these type of experiments with digital people. He mentions some difficulties with running these types of RCTs today, but I think an org specializing in them could help.
Votes/considerations on why this is a good or bad idea are also appreciated!
Thanks, I agree with this and it’s probably not good branding anyway.
I was thinking the “challenge” was just doing the intervention (e.g. being vegan), but agree that the framing is confusing since it refers to something different in the clinical context. I will edit my shortforms to reflect this updated view.
Impactful Forecasting Prize for forecast writeups on curated Metaculus questions
Given the success of this experiment, we should propose a modified version of futarchy where laws are similarly written letter by letter!
What are your thoughts on This can’t go on?
Yeah I’ve been sporadically making progress on a personal forecasting retrospective, will include reflections and updated forecasts if/when I get around to finishing that.
Overall agree that progress was very surprising and I’ll be thinking about how it affects my big picture views on AI risk and timelines; a few relatively minor nitpicks/clarifications below.
For instance, superforecaster Eli Lifland posted predictions for these forecasts on his blog.
I’m not a superforecaster (TM) though I think some now use the phrase to describe any forecasters with good ~generalist track records?
While he notes that the Hypermind interface limited his ability to provide wide intervals on some questions, he doesn’t make that complaint for the MATH 2022 forecast and posted the following prediction, for which the true answer of 50.3% was even more of an outlier than Hypermind’s aggregate:
[image]
The image in the post is for another question: below shows my prediction for MATH, though it’s not really more flattering. I do think my prediction was quite poor.
I didn’t run up to the maximum standard deviation here, but I probably would have given more weight to larger values if I had been able to forecast a mixture of components like on Metaculus. The resolution of 50.3% would very likely (90%) still have been above my 95th percentile though.
Hypermind’s interface has some limitations that prevent outputting arbitrary probability distributions. In particular, in some cases there is an artificial limit on the possible standard deviations, which could lead credible intervals to be too narrow.
I think this maybe (40% for my forecast) would have flipped the MMLU forecast to be inside the 90th credible interval, at least for mine and perhaps for the crowd.
In my notes on the MMLU forecast I wrote “Why is the max SD so low???”
Steelmanning might be particularly useful in cases where we have reason to believe those who have engaged most with the arguments are biased toward ones side of the debate.
As described in But Have They Engaged with the Arguments?, perhaps a reason many who dismiss AI risk haven’t engaged much with the arguments is the selection effect of engaging more if the first arguments one hears seems true. Therefore it might be useful to steelman arguments by generally reasonable people against AI risk that might seem off due to lack of engagement with existing counterarguments, to extract potentially relevant insights (though perhaps an alternative is funding more reasonable skeptics to engage with the arguments much more deeply?).
I think it’s >1% likely that the one of the first few surveys Rohin conducted would result in a fraction of >0.5.
Evidence from When Will AI Exceed Human Performance?, in the form of median survey responses of researchers who published at ICML and NIPS in 2015:
5% chance given to Human Level Machine Intelligence (HLMI) having an extremely bad long run impact (e.g. human extinction)
Does Stuart Russell’s argument for why highly advanced AI might pose a risk point at an important problem? 39% say at least important, 70% at least moderately important.
But on the other hand, only 8.4% said working on this problem is now is more valuable than other problems in the field. 28% said as valuable as other problems.
47% agreed that society should prioritize “AI Safety Research” more than it currently was.
These seem like fairly safe lower bounds compared to the population of researchers Rohin would evaluate, since concern regarding safety has increased since 2015 and the survey included all AI researchers rather than only those whose work is related to AGI.
These responses are more directly related to the answer to Question 3 (“Does X agree that there is at least one concern such that we have not yet solved it and we should not build superintelligent AGI until we do solve it?”) than Question 2 (“Does X broadly understand the main concerns of the safety community?”). I feel very uncertain about the percentage that would pass Question 2, but think it is more likely to be the “bottleneck” than Question 3.
Given these considerations, I increased the probability before 2023 to 10%, with 8% below the lower bound. I moved the median | not never up to 2035 as a higher probability pretty soon also means a sooner median. I decreased the probability of “never” to 20%, since the “not enough people update on it / consensus building takes forever / the population I chose just doesn’t pay attention to safety for some reason” condition seems not as likely.
I also added an extra bin to ensure that the probability continues to decrease on the right side of the distribution.
My snapshot
Note: I’m interning at Ought and thus am ineligible for prizes.