1a3orn
Parameter Scaling Comes for RL, Maybe
It’s working for me? I disabled the cache in devtools and am still seeing it. It looks like it’s hitting a LW-specific CDN also. (https://res.cloudinary.com/lesswrong-2-0/image/upload/v1674179321/mirroredImages/mRwJce3npmzbKfxws/kadwenfpnlvlswgldldd.png)
Thanks for this, this was a fun review of a topic that is both intrinsically and instrumentally interesting to me!
I remain pretty happy with most of this, looking back—I think this remains clear, accessible, and about as truthful as possible without getting too technical.
I do want to grade my conclusions / predictions, though.
(1). I predicted that this work would quickly be exceeded in sample efficiency. This was wrong—it’s been a bit over a year and EfficientZero is still SOTA on Atari. My 3-to-24-month timeframe hasn’t run out, but I said that I expected “at least a 25% gain” towards the start of the time, which hasn’t happened.
(2). There has been a shift to multitask domains, or to multi-benchmark papers. This wasn’t too hard of a prediction, but I think it was correct. (Although of course good evidence for such a shift would require comprehensive lit review.)
To sample two—DreamerV3 is a very recently released model-based DeepMind algorithm. It does very well at Atari100k—it gets a better mean score then everything but EfficientZero—but it also does well at DMLab + 4 other benchmarks + even crafting a Minecraft diamond. The paper emphasizes the robustness of the algorithm, and is right to do so—once you get human-level sample efficiency on Atari100k, you really want to make sure you aren’t just overfitting to that!
And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment.
(3). And finally—well, the last conclusion, that there is still a lot of space for big gains in performance in RL even without field-overturning new insights, is inevitably subjective. But I think the evidence still supports it.
- 20 Jan 2023 20:54 UTC; 8 points) 's comment on The 2021 Review Phase by (
- 27 Jan 2023 23:35 UTC; 2 points) 's comment on Highlights and Prizes from the Review Phase by (
Thermodynamics is the deep theory behind steam engine design (and many other things) -- it doesn’t tell you how to build a steam engine, but to design a good one you probably need to draw on it somewhat.
This post feels like a gesture at a deep theory behind truth-oriented forum / community design (and many other things) -- it certainly doesn’t help tell you how to build one, but you have to think at least around what it talks about to design a good one. Also applicable to many other things, of course.
It also has virtue of being very short. Per-word one of my favorite posts.
I like post because it: —Focuses on a machine which is usually non-central to accounts of the industrial revolution (at least in others which I’ve read), which makes novel and interesting to those interested in the roots of progress —And has a high ratio of specific empirical detail to speculation —Furthermore separates speculation from historical claims pretty cleanly
This post is a good review of a book, to an space where small regulatory reform could result in great gains, and also changed my mind about LNT. As an introduction to the topic, more focus on economic details would be great, but you can’t be all things to all men.
There’s a scarcity of stories about how things could go wrong with AI which are not centered on the “single advanced misaligned research project” scenario. This post (and the mentioned RAAP post by Critch) helps partially fill that gap.
It definitely helped me picture / feel some of what some potential worlds look like, to the degree I currently think something like this—albeit probably slower, as mentioned in the story—is more likely than the misaligned research project disaster.
It also is a (1) pretty good / fun story and (2) mentions the elements within the story which the author feels are unlikely, which is virtuous and helps prevent higher detail from being mistaken for plausibility.
I like this post in part because of the dual nature of the conclusion, aimed at two different audiences. Focusing on the cost of implementing various coordination schemes seems… relatively unexamined on LW, I think. The list of life-lessons is intelligible, actionable, and short.
On the other hand, I think you could probably push it even further in “Secret of Our Success” tradition / culture direction. Because there’s… a somewhat false claim in it: “Once upon a time, someone had to be the first person to invent each of these concepts.”
This seems false about markets, for instance. Markets in goods can exist without any specific person understanding them or how they work, I think? (Far too much of history, after all, is people stumbling across markets, saying “This seems bad”, breaking it, and suffering consequences.) And similarly money-like things can certainly arise without anyone understanding them.
It’s also false (almost certainly?) about language, the o.g. coordination mechanism.
(And if you wanted to reach: Is monogamy a coordination scheme that makes men work harder, as some anthropologists think? If so, it’s doubtful it was conceived of as such by more than a tiny handful of people! Or maybe that’s just stretching “coordination scheme” way too far, I don’t know.)
I don’t really have a greater conclusion from this, though. These are all points in the same direction, moodwise, as the original article is pointing, I think.
- 20 Jan 2023 20:54 UTC; 8 points) 's comment on The 2021 Review Phase by (
- 27 Jan 2023 23:35 UTC; 2 points) 's comment on Highlights and Prizes from the Review Phase by (
That’s 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I’d expect naive action-value methods to have a lot of trouble in multi agent scenarios.
The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can’t really do so. That’s probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).
Re. the bullet list, that’s a little too restrictive, at least in some places—for instance, even if an agent doesn’t know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)
I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.
The two broad paths to general intelligence—RL and LLMs—both had started to stall by the beginning of 2023.
As Chinchilla had shown, data is just as important as compute for training smarter models. The massive increase in performance in the behavior of LLM’s in prior years occurred because of a one-time increase of data—namely, training on nearly everything interesting that humans have ever written. Unless the amount of high quality human text could be increased by 10x, this leap in performance would never happen again. Attempts to improve the behavior of models by pulling text from YouTube with Whisper made the simulacra within the models much better Youtubers—but only infinitesimally marginally better agents. Given that even the largest language models struggle to learn long-tail knowledge reflected in fewer than 100 documents, the inability to 10x and 100x high-quality data proved an immense blocker.
Furthermore, data in text ultimately failed to capture many relevant-human skills, because many human skills are not captured by text. So in early 2024, GPT-4 was again an improvement over GPT-3, but was no closer to being a junior software developer than GPT-3 -- it could not read a Jira ticket, glance at the Figma files, ping the designer to resolve an ambiguity, make the changes and screenshot them for the PR, and so on. Massive bureaucracies and agglomerations with other models tried to accomplish all these tasks, but immediately introduced massive amounts of human engineering that didn’t work very well. So it remains good at copywriting and similar tasks but just isn’t that useful broadly.
RL similarly stalled. Though RL over massive collections of policy-optmized agents could produce behavior of reasonable generality, such RL methods proved very extremely inefficient at producing such generality. Despite things such as the 2019 DeepMind Starcraft victory, that had take in literal centuries of training time; and though more efficient algorithms could be applied toy problem such as Atari, nothing seemed to work in circumstances where essentially infinite data could not be generated.
More to the point, RL remained limited to (1) episodic circumstances over a (2) fixed distribution (2) using models that inevitably had no understanding the domain of the real world, and (3) operate in distinct training / inference modes. All this is different than an AGI would require, and very little progress had been made on these missing ingredients.
In the following years, DL would of course make a lot of progress. It would take many artists jobs. DeepMind achieved superhuman mathematical performance in 2026, which had a host of applications and transformed the field of mathematics. The successful and apparently true physics “theory of everything” was developed by a research laboratory in China in 2028. But, as with the case of drug discovery, image diffusion, and text generation, these successes happened in static domains, and carefully human-designed loss functions. DL which could deal with the real world remained absent.
Generally, I don’t think it’s good to gate “is subquestion X, related to great cause Y, true?” with questions about “does addressing this subquestion contribute to great cause Y?” Like I don’t think it’s good in general, and don’t think it’s good here.
I can’t justify this in a paragraph, but I’m basing this mostly of “Huh, that’s funny” being far more likely to lead to insight than “I must have insight!” Which means it’s a better way of contributing to great causes, generally.
(And honestly, at another level entirely, I think that saying true things, which break up uniform blocks of opinion on LW, is good for the health of the LW community.)
Edit: That being said, if the alternative to following your curiosity on one thing is like, super high value, ofc it’s better. But meh, I mean I’m glad that post is out there. It’s a good central source for a particular branch of criticism, and I think it helped me understand the world more.
Yes, and to expand only slightly: Coordinating against dishonest agents or practices is an extremely important part of coordination in general; if you cannot agree on removing dishonest agents or practices from your own group, the group will likely be worse at accomplishing goals; groups that cannot remove dishonest instances will be correctly distrusted by other groups and individuals.
All of these are important and worth coordinating on, which I think sometimes means “Let’s condemn X” makes sense even though the outside view suggests that many instances of “Let’s condemn X” are bad. Some inside view is allowed.
It’s not a counter-argument to the post in its entirety, though—it’s a counter-argument to the recommendation that we de-escalate, from the Twitter post, no? Specifically, it’s not a counter-argument to the odds of nuclear war if we don’t de-escalate.
Two things can be true at once:
Not seeking a complete Russian defeat runs a 1-in-6 chance of Nuclear War—or say 1-in-N for the general case.
Not seeking a complete Russian defeat means that we’ve responded partially to blackmail in a game-theoretically nonoptimal fashion, which means we have M% increased odds of nuclear proliferation in the future and correspondingly O% increased odds of nuclear war in a 50-year time horizon.
But like—these can both be true! Doing the game-theoretic thing where you don’t respond to blackmail means that you might suffer the consequences of not responding to blackmail, especially if your opponent is feeling vindictive, or did not anticipate your not responding to his blackmail, or feels the need to show his resolution for further iterations of his blackmail game.
It’s possible for you to not respond to blackmail because you have a general principle of not doing so and then for nuclear war to happen as a result.
I don’t know if you’re intentionally recapitulating this line of argument, but C.S. Lewis makes this argument in Miracles. There’s a long history of the back and forth on wikipedia
I don’t think it works, mostly because the fact that a belief is result of a physical process doesn’t tell my anything at all about the rationality / irrationality of belief. Different physical processes should be judged differently; some are entangled with the resulting state of belief and others aren’t.
One slightly counterintuitive thing about this paper is how little it improves on the GSM8K dataset, given that it does very well on relatively advanced test sets.
The Grade School Math, 8-K is a bundle of problems suitable for middle-schoolers. It has problems like:
“Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”
“Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?”
Minerva improves the SOTA on this, but only moves it from 74.5% to 78.5%, which is not as big of a deal.
My innate / naive sense of how hard the MATH problems are would lead me to think you could get > 90% on GSM8K if you could get 50% on MATH. But obviously my gut sense is off.
I’d be really curious to know what’s going on here.
I’m curious what kind of blueprint / design docs / notes you have for the voluntarist global government. Do you have a website for this? Is there a governmental-design discord discussing this? What stage is this at? etc.
The article title here is hyperbolic.
The title is misleading in the same way that calling AlphaStar a “a Western AI optimized for strategic warfare” is misleading. Should we also say that the earlier western work on Doom—see VizDoom—was also about creating “agents optimized for killing”? That was work on a FPS as well. This is just more of the same—researchers trying to find interesting video games to work on.
This work transfers with just as much easy / difficulty to real-world scenarios as AI work on entirely non-military-skinned video games—that is, it would take enormous engineering effort, and any use in military robots would be several levels of further work removed, such that the foundation of a military system would be very different. (I.e., military robots can’t work with behavioral cloning based on absolutely unchanging + static environments / maps, with clean command / movement relations, for many reasons). Many researcher’s work on navigating environments—though not military-themed—would be just as applicable.
For investigation of the kind of thing you suggest, take a look at Anthropic’s “A General Language Assistant as a Laboratory for Alignment” and more importantly “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”.
They focus on training a helpful / harmless assistant rather than good short stories, but using human-filtered model-output to improve behavior is the basic paradigm.
I’m quite unsure as well.
On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.
But on the other—it does seem to do quite well on different envs, and if this wasn’t hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don’t trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it’s vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)
Similarly—I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.