What is the GShard dense transformer you are referring to in this post?
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future
I think this improper prior approach makes sense.
I am a bit confused on the step when you go from an improper prior to saying that the “expected” effort would land in the middle of these numbers. This is because the continuous part of the total effort spent vs doubling factor is concave, so I would expect the “expected” effort to be weighted more in favor of the lower bound.
I tried coding up a simple setup where I average the graphs across a space of difficulties to approximate the “improper prior” but it is very hard to draw a conclusion from it. I think the graph suggests that the asymptotic minimum is somewhere above 2.5 but I am not sure at all.
Also I guess it is unclear to me whether a flat uninformative prior is best, vs an uninformative prior over logspace of difficulties.
What do you think about both of these things?
Code for the graph:
import numpy as np
import matplotlib.pyplot as plt
effort_spent = lambda d,b : (b**(np.ceil(math.log(d, b))+1)-1) / (b-1)
ds = np.linspace(2, 1000000, 100000)
hist = np.zeros(shape=(1000,))
for d in ds:
bs = np.linspace(1.1, 5, 1000)
hist += np.vectorize(lambda b : effort_spent(d,b))(bs) / len(ds)
I really like this article.
It has helped me appreciate how product rules (or additivity, if we apply a log transform) arises in many contexts. One thing I hadn’t appreciated when studying Cox theorem is that you do not need to respect “commutativity” to get a product rule (though obviously this restricts how you can group information). This was made very clear to me in example 3.
One thing that confused me in the first reading was that I misunderstood you as referring to the third requirement as associativity of F. Rereading this is not the case; you just say that the third requirement implies that F is associative. But I wish you had spelled out the implication, ie saying thatF(F(R(A),R(B|A)),R(C|A,B))=F(R(A),F(R(B|A),R(C|A,B))).
Good suggestion! Understanding the trend of record-setting would be interesting indeed so that we avoid the pesky influence of the systems which are below the trend like CURL in the game domain.
The problem with the naive setup of just regressing on record-setters is that is quite sensitive to noise—one early outlier in the trend can completely alter the result.
I explore a similar problem in my paper Forecasting timelines of quantum computing, where we try to extrapolate progress on some key metrics like qubit count and gate error rate. The method we use in the paper to address this issue is to bootstrap the input and predict a range of possible growth rates—that way outliers do not completely dominate the result.
I will probably not do it right now for this dataset, though I’d be interested in having other people try that if they are so inclined!
This is now fixed; see the updated graphs. We have also updated the eye ball estimates accordingly.
Trying to think a bit harder about this—maybe companies are sort of like this? To manage my online shop I need someone to maintain the web, someone to handle marketing, etc. I need many people to work for me to make it work, and I need all of them at once. Let’s suppose that I pay my workers directly proportionally to the amount of sales they manage to make it more obvious.
As I painted it, this is not about amortizing a fixed cost. And I cannot subdivide the task—if I tell my team I expect to make only 10 sales and pay accordingly they are going to tell me go eff myself (though maybe in the magical world where there are no task-switching costs this breaks down).
Another try: maybe a fairness constraint can force a minimum. The government has given me the okay to sell my new cryonics procedure, but only if I can make enough for everyone.
You are quite right that 1 and 2 are related, but the way I was thinking about them I didn’t have them as equivalent.
1 is about fixed costs; each additional sheet of paper I produce amortizes part of the initial, fixed cost 2 is about a threshold of operation. Even if there are no fixed costs, it would happen in a world when I can only produce in large bulks and no individual units.
Then again, I am struggling to think of a real-life example of 2, so maybe it is not something that happens in our universe.
I’m confused. Why would diminishing marginal returns incentivize trade? If the first unit of everything was very cheap then I would rather produce it myself than produce extra of one things (which costs more) then trade.
Other magical powers of trade:
Economies of scale. It is basically as easy for me to produce 20 sheets of paper as to produce 1 ; after paying the set up costs the marginal costs are much smaller in comparison. So all in all I would rather specialize in paper-making, have somebody else specialize in pencil-making, then trade.
Investment. often I need A LOT of capital to get something started, more than I could reasonably accumulate over a lifetime. So I would rather trade the starting capital for IOUs I will get from profit.
Insurance. I may have a particularly bad harvest this year and a very good one the next one, while my neighbour might have the opposite problem. All in all I would rather we pool our harvests each year, so that we can have food both years. So we are “trading” part of our harvest for insurance.
Thank you! The shapes mean the same as the color (ie domain) - they were meant to make the graph more clear. Ideally both shape and color would be reflected in the legend. But whenever I tried adding shapes to the legend instead a new legend was created, which was more confusing.
If somebody reading this knows how to make the code produce a correct legend I’d be very keen on hearing it!EDIT: Now fixed
Thank you! I think you are right—by default the Altair library (what we used to plot the regressions) does OLS fitting of an exponential instead of fitting a linear model over the log transform. We’ll look into this and report back.
Thank you! Now fixed :)
Thank you for the feedback, I think what you say makes sense.
I’d be interested in seeing whether we can pin down exactly in what sense are Switch parameters “weaker”. Is it because of the lower precision? Model sparsity (is Switch sparse on parameters or just sparsely activated?)?
What do you think, what typology of parameters would make sense / be useful to include?
re: “I’d expect experts to care more about the specific details than I would”
Good point. We tried to account for this by making it so that the experts do not have to agree or disagree directly with each sentence but instead choose the least bad of two extreme positions.
But in practice one of the experts bypassed the system by refusing to answer Q1 and Q2 and leaving an answer in the space for comments.