How might we quantify size in our definitions above?
How might we quantify size in our definitions above?
Random K complexity inspired measure of size for a context / property / pattern.
Least number of squares you need to turn on, starting from an empty board, so that the grid eventually evolves into the context.
It doesn’t work for infinite contexts though.
Here is your estimate in the context of other big AI models:
An interactive visualization is available here.
I skimmed this paper and liked it. My personal takeaway is that historical fit to a sigmoid model is not predictive of future fit. If you want to have a good sigmoid forecast, you need to have good priors on what are the mechanisms causing the fastening and dampening of the curve.
Thank you for sharing this!
My user experience
When I first load the page, I am greeted by an empty space.
From here I didn’t know what to look for, since I didn’t remember what kind of things where in the database.
I tried clicking on table to see what content is there.
Ok, too much information, hard to navigate.
I remember that one of my manuscripts made it to the database, so I look up my surname
That was easy! (and it loaded very fast)
The interface is very neat too. I want to see more papers, so I click on one of the tags.
I get what I wanted.
Now I want to find a list of all the tags. Hmmm I cannot find this anywhere.
I give up and look at another paper:
Oh cool! The Alignmnet Newsletter summary is really great. Whenever I read something in Google Scholar it is really hard to find commentary on any particular piece.
I tried now to look for my current topic of research to find related work
Meh, not really anything interesting for my research.
Ok, now I want to see if Open AI’s “AI and compute” post is in the dataset:
Huhhh it is not here. The bitter lesson is definitely relevant, but I am not sure about the other articles.
Can I search for work specific to open ai?
Hmm that didnt quite work. The top result is from OpenAI, but the rest are not.
Maybe I should spell it different?
Oh cool that worked! So apparently the blogpost is not in the dataset.
Anyway, enough browsing for today.
This is a very cool tool. The interface is neatly designed.
Discovering new content seems hard. Some things that could help include a) adding recommended content on load (perhaps things with most citations, or even ~10 random papers) and b) having a list of tags somewhere
The reviewer blurbs are very nice. However I do not expect to use this tool. Or rather I cannot think right now of what exactly I would use this tool for. It has made me consider reaching out to the database mantainers to suggest the inclusion of an article of mine. So maybe like that, to promote my work?
Deep or shallow version?
One more question: for the BigGAN which model do your calculations refer to?
Could it be the 256x256 deep version?
Some discussion from the OP here
Update: I tried regressing on the ordinal position of the world records and found a much better fit, and better (above baseline!) forecasts of the last WR of each category.
This makes me update further towards the hypothesis that date is a bad predictive variable. Sadly this would mean that we really need to track whatever the index in WR is correlated with (presumably the cumulative number of runs overall by the speedrunning community).
This is so cool!
It seems like the learning curves are reasonable close to the diagonal, which means that:
Given the logarithmic X-axis, it seems like improvements become increasingly harder over time. You need to invest exponentially more time to get a linear improvement.
The rate of logarithmic improvement is overall relatively constant.
On the other hand, despite all curves being close to the diagonal, they seem to mostly undershoot it. This might imply that the rate of improvement is slighly decreasing over time.
One thing that tripped me from this graph for other readers: the relative attempt is wrt to the amount of WR improvements. That means that if there are 100 WRs, the point with relative attempt = 0.5 is the 50th WR improvement, not the one whose time is closer to the average between the date of the first and last attempt.
So this graph is giving information about “conditional on you putting enough effort to beat the record, by how much should you expect to beat it?” rather than on “conditional on spending X amount of effort on the margin, by how much should you expect to improve the record?”.
Here is the plot that would correspond to the other question, where the x axis value is not proportional to the ordinal index of WR improvement but to the date when the WR was submitted.
It shows a far weaker correlation. This suggests that a) the best predictor of new WRs is the amount of runs overall being put into the game and 2) the amount of new WRs around a given time is a good estimate of the amount of runs overall being put into the game.
This has made me update a bit against plotting WR vs time, and in favor of plotting WR vs cumulative number of runs. Here are some suggestions about how one could go about estimating the number of runs being put into the game, if somebody want to look into this!
PS: the code for the graph above, and code to replicate Andy’s graph, is now here
Is there any way to estimate how many cumulative games that speedrunners have run at a given point?
One should be able to use the Speedrun.com API to search for the number of runs submitted by a certain date, as a proxy for the cumulative games (though it will not reflect all attempts since AFAIK many runners only submit their personal bests to speedrun.com).
Additionally, speedrun.com provides some stats on the amount of runs and players for each game, for example the current stats for Super Metroid can be found here: https://www.speedrun.com/supermetroid/gamestats
There are some problems with this approach too.
These are aggregated by game, not by category, so one would need to somehow split the runs among popular categories of the same game.
There is only current data avaiable through the webpage. There might be a way to access historical data through the API. If not, one would need to use archived versions of the pages and interpolate the scrapped stats.
I’d be excited about learning about the results of either approach if anybody ends up scrapping this data!
Those are good suggestions!
Here is what happens when we align the start dates and plot the improvements relative to the time of the first run.
I am slightly nervous about using the first run as the reference, since early data in a category is quite unrealiable and basically reflects the time of the first person to thought to submit a run. But I think it should not create any problems.
Interestingly, plotting the relative improvement reveals some S-curve patterns, with phases of increasing returns followed by phases of diminishing returns.
I did not manage either to beat the baseline by extrapolating the relative improvement times. Interestingly, using a grid to count non-improvements as observations made the extrapolation worse, so this time the best fit was achieved with log linear regression over the last 8 weeks of data in each category.
As before, the code to replicate my analysis is available here.
Haven’t had time yet to include logistic models or do analysis of the derivative of the improvements—if you feel so inclined feel free to reuse my code to perform the analysis yourself and if you share them here we can comment on the results!
PS: there is a sentence missing an ending in your comment
Do you mind sharing your guesstimate on number of parameters?
Also, do you have per chance guesstimates on number of parameters / compute of other systems?
Ah, I just realized this is the norm with curated posts. FWIW I feel a bit awkward to have the curation comments displayed so prominently, since it alters the author’s intended reading experience in a way I find a bit weird / offputting.
If it was up to me, I would remove the curator’s words at the top of a post in favor of comments like this one, where the reasons for curation are explained but its not the first thing that readers see when reading the post.
Meta: it seems like you have accidentally added this comment also at the beginning of the post besides commenting?
What is the GShard dense transformer you are referring to in this post?
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future
I think this improper prior approach makes sense.
I am a bit confused on the step when you go from an improper prior to saying that the “expected” effort would land in the middle of these numbers. This is because the continuous part of the total effort spent vs doubling factor is concave, so I would expect the “expected” effort to be weighted more in favor of the lower bound.
I tried coding up a simple setup where I average the graphs across a space of difficulties to approximate the “improper prior” but it is very hard to draw a conclusion from it. I think the graph suggests that the asymptotic minimum is somewhere above 2.5 but I am not sure at all.
Also I guess it is unclear to me whether a flat uninformative prior is best, vs an uninformative prior over logspace of difficulties.
What do you think about both of these things?
Code for the graph:
import numpy as np
import matplotlib.pyplot as plt
effort_spent = lambda d,b : (b**(np.ceil(math.log(d, b))+1)-1) / (b-1)
ds = np.linspace(2, 1000000, 100000)
hist = np.zeros(shape=(1000,))
for d in ds:
bs = np.linspace(1.1, 5, 1000)
hist += np.vectorize(lambda b : effort_spent(d,b))(bs) / len(ds)