We’d probably try something along the lines you’re suggesting, but there are some interesting technical challenges to think through.
For example, we’d want to train the model to be good at predicting the future, not just knowing what happened. Under a naive implementation, weight updates would probably go partly towards better judgment and forecasting ability, but also partly towards knowing how the world played out after the initial training cutoff.
There are also questions around IR; it seems likely that models will need external retrieval mechanisms to forecast well for the next few years at least, and we’d want to train something that’s natively good at using retrieval tools to forecast, rather than relying purely on its crystalised knowledge.
Good question. We don’t explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can’t be found.
Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.