Agree, whether a world-modeling technique makes money does seem like a load bearing part of one’s mental model of it.
I do want to emphasize though, directly testing forecast accuracy via Brier scores is often easier and more informative.
Agree, whether a world-modeling technique makes money does seem like a load bearing part of one’s mental model of it.
I do want to emphasize though, directly testing forecast accuracy via Brier scores is often easier and more informative.
It’s a fair point. I don’t think prediction market profit is the best eval of these methods. Pure forecast accuracy is easy to measure, either in live forecasting or past-casting. But I agree prediction market signal is one type of eval, so we do use it at FutureSearch to learn from our mistakes and improve.
Our portfolios and our P&L on Kalshi and Polymarket are at https://markets.futuresearch.ai/. It’s mostly synthetic trading, though real trading at Kalshi has 2 weeks of data now.
I’d be curious if @mabramov you’d respond to @Josh Rosenberg’s points about comparative evaluation of impact for other grant-based non-profits generating research and public information.
How valuable do you think the four examples Josh gave: Epoch, Our World In Data, GovAI, and IAPS have been? Do you think the grants for these orgs have a good ROI compared to forecasting research?
I have coined Schwarz’s First Law, which is “Everyone is only good at one thing.”
Comes up a lot. Scott Alexander made this point in his recent https://www.astralcodexten.com/p/the-dilbert-afterlife:
> Michael Jordan was the world’s best basketball player, and insisted on testing himself against baseball, where he failed. Herbert Hoover was one of the world’s best businessmen, and insisted on testing himself against politics, where he crashed and burned. We’re all inmates in prisons of different names. Most of us accept it and get on with our lives. Adams couldn’t stop rattling the bars.
This expanded list is great, but is still conspicuously missing white-collar work. Software was already the basis for the trend, so the only new one here that seems to give clear information on human labor impacts would be tesla_fsd.
(And even there replacing human drivers with AI drivers doesn’t seem like it would change much for humanity, compared to lawyers/doctors/accountants/sales/etc.)
Is it the case that for most non-software white-collar work, agents can only do ~10-20 human-minute tasks with any reliability, so the doubling time is hard to measure?
9 years since the last comment—I’m interested in how this argument interacts with GPT-4 class LLMs, and “scale is all you need”.
Sure, LLMs are not evolved in the same way as biological systems, so the path towards smarter LLMs aren’t fragile in the way brains are described in this article, where maybe the first augmentation works, but the second leads to psychosis.
But LLMs are trained on writing done by biological systems with intelligence that was evolved with constraints.
So what does this say about the ability to scale up training on this human data in an attempt to reach superhuman intelligence?
Thank you for the careful look into data leakage in the other thread! Some of your findings were subtle, and these are very important details.
Instead of writing a long comment, we wrote a separate post that, like @habryka and Daniel Halawi did, looks into this carefully. We re-read all 4 papers making these misleading claims this year and show our findings on how they’re falling short.
https://www.lesswrong.com/posts/uGkRcHqatmPkvpGLq/contra-papers-claiming-superhuman-ai-forecasting
Good point. For this public report, we manually checked all the data points that were included here. FutureSearch threw out many other unreliable data points it couldn’t corroborate, that’s a core part of what it does.
The sources linked here are low quality data brokers due to a bug—there is a higher quality data source corroborating it, but FutureSearch doesn’t cite the higher quality one.
We’re working on fixing this, and identifying all primary vs. secondary sources.
All of the research was done by FutureSearch, so AI, with a few exceptions, such as https://app.futuresearch.ai/reports/3Li1?nodeId=MIw9, where it couldn’t infer good team/enterprise ratios from analogous products where numbers were reliable. Estimating ChatGPT Teams subscribers was the hardest part, requiring the most judgment.
Most of the final words in the report were written or revised by humans. We put a high quality bar on this to publish it publicly, and did more human intervention than normal.
(Responded to the version of this on the EA Forum post.)
Let’s say you were an Anthropic negotiator in DC, and you had to make a decision between say, offering a public apology, or handing over Glasswing information, or weakening the model, etc.
Would you try a causal model like this, based on forecasts of the outcomes conditioned on each of your choices? Since you’re the decision maker in this futarchy scenario, there should be no entanglement between the current world, and the worlds where you’ve made each choice, right?