Instead of writing a long comment, we wrote a separate post that, like @habryka and Daniel Halawi did, looks into this carefully. We re-read all 4 papers making these misleading claims this year and show our findings on how they’re falling short.
https://www.lesswrong.com/posts/uGkRcHqatmPkvpGLq/contra-papers-claiming-superhuman-ai-forecasting
dschwarz
Contra papers claiming superhuman AI forecasting
Unit economics of LLM APIs
Good point. For this public report, we manually checked all the data points that were included here. FutureSearch threw out many other unreliable data points it couldn’t corroborate, that’s a core part of what it does.
The sources linked here are low quality data brokers due to a bug—there is a higher quality data source corroborating it, but FutureSearch doesn’t cite the higher quality one.
We’re working on fixing this, and identifying all primary vs. secondary sources.
All of the research was done by FutureSearch, so AI, with a few exceptions, such as https://app.futuresearch.ai/reports/3Li1?nodeId=MIw9, where it couldn’t infer good team/enterprise ratios from analogous products where numbers were reliable. Estimating ChatGPT Teams subscribers was the hardest part, requiring the most judgment.
Most of the final words in the report were written or revised by humans. We put a high quality bar on this to publish it publicly, and did more human intervention than normal.
[EAForum xpost] A breakdown of OpenAI’s revenue
(Responded to the version of this on the EA Forum post.)
[EA xpost] The Rationale-Shaped Hole At The Heart Of Forecasting
Great post!
| Manifold markets that were resolved after GPT-4’s current knowledge cutoff of Jan 1, 2022
Were you able to verify that newer knowledge didn’t bleed in? Anecdotally GPT-4 can report various different cutoff dates, depending on the API. And there is anecdotal evidence that GPT-4-0314 occasionally knows about major world events after its training window, presumably from RLHF?
This could explain the better scores on politics than science.
Nice post! I’ll throw another signal boost for the Metaculus hackathon that OP links, since this is the first time Metaculus is sharing their whole 1M db of individual forecasts (not just the db of questions & resolutions which is already available). You have to apply to get access though. I’ll link it again even though OP already did: https://metaculus.medium.com/announcing-metaculuss-million-predictions-hackathon-91c2dfa3f39
There are nice cash prizes too.As the OP writes, I think most the ideas here would be valid entries in the hackathon, though the emphasis is on forecast aggregation & methods for scoring individuals. I’m particularly interested in decay of predictions idea. I don’t think we know how well predictions age, and what the right strategy for updating your predictions should be for long-running questions.
Metaculus is seeking Software Engineers
I have to respectfully disagree with your position. Kant’s point, and the point of similar people who make the sweeping universalizations that you dislike, is that it is only in such idealized circumstances that we can make rational decisions. What makes a decision good or bad is whether it would be the decision rational people would endorse in a perfect society.
The trouble is not moving from our flawed world to an ideal world. The trouble is taking the lesson we’ve learned from considering the ideal world and applying it to the flawed world. Kant’s program is widely considered to be a failure because it fails to provide real guidelines for the real world.
Basically, my point is that asking the Rawlsian “Would you prefer to live in a society where people do X” is valid. However, one may answer that question with “yes” and still rationally refrain from doing X. So your general point, that local and concrete decisions rule the day, still stands. Personally, though, I try to approach local and concrete decisions the way that Rawls does.
Thank you for the careful look into data leakage in the other thread! Some of your findings were subtle, and these are very important details.