I also found this thread of math topics on AI safety helpful.

# Jsevillamol

# The longest training run

# A time-invariant version of Laplace’s rule

# Announcing Epoch: A research organization investigating the road to Transformative AI

Ah sorry for the lack of clarity—let’s stick to my original submission for PVE

That would be:

[0,1,0,1,0,0,9,0,0,1,0,0]

Yes, I am looking at decks that appear in the dataset, and more particularly at decks that have faced a deck similar to the rival’s.

Good to know that one gets similar results using the different scoring functions.

I guess that maybe the approach does not work that well ¯\_(ツ)_/¯

Thank you for bringing this up!

I think you might be right, since the deck is quite undiverse and according to the rest diversity is important. That being said, I could not find the mistake in the code at a glance :/

Do you have any opinions on [1, 1, 0, 1, 0, 1, 2, 1, 1, 3, 0, 1]? This would be the worst deck amongst the decks that played against a deck similar to the rival’s in my code, according to my code.

Marius Hobbhahn has estimated the number of parameters here. His final estimate is

**3.5e6 parameters**.Anson Ho has estimated the training compute (his reasoning at the end of this answer). His final estimate is

**7.8e22 FLOPs**.Below I made a visualization of the parameters vs training compute of n=108 important ML system, so you can see how DeepMind’s syste (labelled GOAT in the graph) compares to other systems.

[Final calculation]

(8 TPUs)(4.20e14 FLOP/s)(0.1 utilisation rate)(32 agents)(7.3e6 s/agent) = 7.8e22 FLOPs==========================

NOTES BELOW[Hardware]

- “Each agent is trained using 8 TPUv3s and consumes approximately 50,000 agent steps (observations) per second.”

- TPUv3 (half precision): 4.2e14 FLOP/s

- Number of TPUs: 8

- Utilisation rate: 0.1[Timesteps]

- Figure 16 shows steps per generation and agent. In total there are 1.5e10 + 4.0e10 + 2.5e10 + 1.1e11 + 2e11 = 3.9e11 steps per agent.

- 3.9e11 / 5e4 = 8e6 s → ~93 days

- 100 million steps is equivalent to 30 minutes of wall-clock time in our setup. (pg 29, fig 27)

- 1e8 steps → 0.5h

- 3.9e11 steps → 1950h → 7.0e6 s → ~82 days

- Both of these seem like overestimates, because:

“Finally, on the largest timescale (days), generational training iteratively improves population performance by bootstrapping off previous generations, whilst also iteratively updating the validation normalised percentile metric itself.” (pg 16)

- Suggests that the above is an overestimate of the number of days needed, else they would have said (months) or (weeks)?

- Final choice (guesstimate): 85 days = 7.3e6 s[Population size]

- 8 agents? (pg 21) → this is describing the case where they’re not using PBT, so ignore this number

- The original PBT paper uses 32 agents for one task https://arxiv.org/pdf/1711.09846.pdf (in general it uses between 10 and 80)

- (Guesstimate) Average population size: 32

Fixed, thanks!

Here is my very bad approach after spending ~one hour playing around with the data

Filter decks that fought against a similar to the rivals deck, using a simple measure of distance (sum of absolute differences between the deck components)

Compute a ‘score’ of the decks. The score is defined as the sum of 1/deck_distance(deck) * (1 or −1 depending on whether the deck won or lost against the challenger)

Report the deck with the maximum score

So my submission would be: [0,1,0,1,0,0,9,0,0,1,0,0]

Seems like you want to include A, L, P, V, E in your decks, and avoid B, S, K. Here is the correlation between the quantity of each card and whether the deck won. The ordering is ~similar when computing the inclusion winrate for each card.

Thanks for the comment!

I am personally sympathetic to the view that AlphaGo Master and AlphaGo Zero are off-trend.

In the regression with all models the inclusion does not change the median slope, but drastically increases noise, as you can see for yourself in the visualization selecting the option ‘

**big_alphago_action = remove’**(see table below for a comparison of regressing the large model trend without vs with the big AlphaGo models).In appendix B we study the effects of removing AlphaGo Zero and AlphaGo Master when studying record-setting models. The upper bound of the slope is affected dramatically, and the R2 fit is much better when we exclude them, see table 6 reproduced below.

Following up on this: we have updated appendix F of our paper with an analysis of different choices of the threshold that separates large-scale and regular-scale systems. Results are similar independently of the threshold choice.

# Compute Trends — Comparison to OpenAI’s AI and Compute

Thanks for engaging!

To use this theorem, you need both an (your data / evidence), and a (your parameter).

Parameters are abstractions we use to simplify modelling. What we actually care about is the probability of unkown events given past observations.

You start out discussing what appears to be a combination of two forecasts

To clarify: this is not what I wanted to discuss. The expert is reporting how you should update your priors given the evidence, and remaining agnostic on what the priors should be.

A likelihood isn’t just something you multiply with your prior, it is a conditional pmf or pdf with a

*different outcome*than your prior.The whole point of Bayesianism is that it offer a precise, quantitative answer to how you should update your priors given some evidence—and that is multiplying by the likelihoods.

This is why it is often recommend in social sciences and elsewhere to report your likelihoods.

I’m not sure we ever observe [the evidence vector] directly

I agree this is not common in judgemental forecasting, where the whole updating process is very illegible. I think it holds for most Bayesian-leaning scientific reporting.

it is pretty clear from your post that you’re talking about in the sense used above, not .

I am not, I am talking about evidence = likelihood vectors.

One way to think about this is that the expert is just informing us about how we should update our beliefs. “Given that the pandemic broke out in Wuhan, your subjective probability of a lab break should increase and it should increase by this amount”. But the final probability depends on your prior beliefs, that the expert cannot possibly know.

I don’t think there is a unique way to go from to, let’s say,

*,*whereYes! If I am understanding this right, I think this gets to the crux of the post. The compression is

*lossy*, and neccessarily loses some information.

Great sequence—it is a nice compendium of the theories and important thought experiments.

I will probably use this as a reference in the future, and refer other people here for an introduction.

Looking forward to future entries!

# Projecting compute trends in Machine Learning

I am glad Yair! Thanks for giving it a go :)

As it is often the case, I just found out that Jaynes was already discussing a similar issue to the paradox here in his seminal book.

This wikipedia article summarizes the gist of it.