Julian Schrittwieser

Karma: 95

[Question] What problem would you like to see Reinforcement Learning applied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC

43 points

4 comments1 min readLW link

Julian Schrittwieser 31 Jul 2022 12:29 UTC
22 points
6
on: chinchilla’s wild implications
An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one “epoch”.

In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.

Julian Schrittwieser 5 Dec 2021 12:10 UTC
16 points
on: EfficientZero: How It Works
Nice summary! I agree, this is an interesting paper :)

But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There’s a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there’s a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I’m not sure why one turned out to be true rather than the other.

This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.

For larger datasets (eg standard 200 million frame setting in Atari) you’ll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn’t include ablations for this, but I might test it if I have time.

To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.

Julian Schrittwieser 17 Feb 2022 14:03 UTC
8 points
on: REPL’s: a type signature for agents
Could you explain how this differs from the standard Reinforcement Learning formulation? (See eg. http://incompleteideas.net/book/first/ebook/node28.html for an introduction)

Julian Schrittwieser 31 Aug 2021 20:31 UTC
4 points
in reply to: Rohin Shah’s comment on: Parameter counts in Machine Learning
The difference in compute between AlexNet and AlphaZero is because for AlexNet you are only counting the flops during training, while for AlphaZero you are counting both the training and the self-play data generation (which does 800 forwards per move * ~200 moves to generate each game).
If you were to compare supervised training numbers for both (e.g. training on human chess or Go games) then you’d get much closer.

Julian Schrittwieser 4 Feb 2022 10:05 UTC
2 points
in reply to: Gurkenglas’s comment on: OpenAI Solves (Some) Formal Math Olympiad Problems
This is indeed amusing. In reality, the action space can be taken to be of size 256 (the number of possible byte values), with the number of bytes in the solution as the episode length. Note also that 256 is an upper bound, not all byte values are valid at all points, and most of the time only the 128 ASCII values are used. Using a tokenizer as is standard in language models simply reduces the episode length by increasing the action space, it does not change the size of the overall state space.

This also means that, despite their claims, the search space for the example solutions shown on their website is similar or smaller than for board games such as Chess and Go :D

Julian Schrittwieser 3 Dec 2021 9:35 UTC
2 points
on: Omicron Post #3
Does Omicron already having spread through community transmission in the Netherlands (and other European countries) before the reports from South Africa, yet still not being as widespread in Europe, suggest that it’s not that transmissive after all?