Gato as the Dawn of Early AGI
Written in a hurry today at the EA UCLA AI Timelines Workshop. Long and stream-of-thought, and a deliberate intellectual overreach as an epistemic exercise. My first foray into developing my own AGI timelines model without deferring! Please, I beg of you, tell me why I’m wrong in the comments!
Epistemic status: Small-N reasoning. Low confidence, but represents my standing understanding of AGI timelines as of now.
This exchange caught my eye a couple days ago:
Would it be fair to call this AGI, albeit not superintelligent yet?
Gato performs over 450 out of 604 tasks at over a 50% expert score threshold.
Yes. Sub-human-level AGI.
If true, this is a huge milestone!
Here I’m combining thinking about this with thinking about AGI 10 years hence. The latter forecasting task is totally different if we have a form of AGI as of two days ago, even an admittedly weak form.
Do We Have “Subhuman AGI” as of Two Days Ago?
If I want to forecast AGI 10 years out, I first want to understand current-year AGI. Do we currently have AGI? What does it look like? What hyperparameters, up to and including overall architecture, might we push on and make progress on in the coming decade?
We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters in the case of Gato. As hardware and model architectures improve, this operating point will naturally increase the feasible model size, pushing generalist models higher up the scaling law curve. For simplicity Gato was trained offline in a purely supervised manner; however, in principle, there is no reason it could not also be trained with either offline or online reinforcement learning (RL) (p. 2).
Gato is small, parameters-wise. At 1.2 billion parameters, it’s 1/100th the size of the largest GPT-3 model and 1/500th the size of PaLM. This is a choice due to hardware constraints; larger models, which definitely could have been used here, would not be able to operate real-world robotic limbs in real time. Additionally, the tasks on which Gato was trained were chosen purely incidentally: they could have easily been otherwise. So Gato is a miniaturized version of models to come in the near future.
After converting data into tokens, we use the following canonical sequence ordering.
Text tokens in the same order as the raw input text.
Image patch tokens in raster order.
Tensors in row-major order.
Nested structures in lexicographical order by key.
Agent timesteps as observation tokens followed by a separator, then action tokens.
Agent episodes as timesteps in time order (p. 3).
Gato is a large transformer with a single sense-modality: it receives diverse kinds of inputs pressed into a sequence of tokens, and outputs more tokens. This is just applying the current successful language-model architecture to varied-domain problem solving in the naïve way.
Because distinct tasks within a domain can share identical embodiments, observation formats and action specifications, the model sometimes needs further context to disambiguate tasks. Rather than providing e.g. one-hot task identifiers, we instead … use prompt conditioning. During training, for 25% of the sequences in each batch, a prompt sequence is prepended, coming from an episode generated by the same source agent on the same task. Half of the prompt sequences are from the end of the episode, acting as a form of goal conditioning for many domains; and the other half are uniformly sampled from the episode. During evaluation, the agent can be prompted using a successful demonstration of the desired task, which we do by default in all control results that we present here.
…Because agent episodes and documents can easily contain many more tokens than fit into context, we randomly sample subsequences of 𝐿 tokens from the available episodes. Each batch mixes subsequences approximately uniformly over domains (e.g. Atari, MassiveWeb, etc.), with some manual upweighting of larger and higher quality datasets (p. 4).
Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment (p. 11).
A major part of the guts of the model is the use of internal prompt programming. Context length limits don’t prevent training high-performance in Gato, but does prevent us from fully testing the out-of-the-box model’s few-shot generalization abilities.
Our control tasks consist of datasets generated by specialist SoTA or near-SoTA reinforcement learning agents trained on a variety of different environments. For each environment we record a subset of the experience the agent generates (states, actions, and rewards) while it is training (p. 5).
Gato is trained to do RL-style tasks by supervised learning on token sequences generated from state-of-the-art RL model performance. These tasks take place in both virtual and real-world robot arm environments.
Figure 10 compares the success rate of Gato across different fine-tuning data regimes to the sim-to-real expert and a Critic-Regularized Regression (CRR) (Wang et al., 2020) agent trained on 35k episodes of all test triplets. Gato, in both reality and simulation (red curves on the left and right figure, respectively), recovers the expert’s performance with only 10 episodes, and peaks at 100 or 1000 episodes of fine-tuning data, where it exceeds the expert. After this point (at 5000), performance degrades slightly but does not drop far below the expert’s performance (p. 12).
A crucial datapoint for the subhuman AGI question! Gato, after a small amount of fine-tuning, catches up with SOTA RL expert models. It does this even with real-world colored-block stacking tasks; Gato is capable of interfacing with the physical world, albeit in a controlled environment.
Gato was inspired by works such as GPT-3 (Brown et al., 2020) and Gopher (Rae et al., 2021), pushing the limits of generalist language models; and more recently the Flamingo (Alayrac et al., 2022) generalist visual language model. Chowdhery et al. (2022) developed the 540B parameter Pathways Language Model (PalM) explicitly as a generalist few-shot learner for hundreds of text tasks. Future work should consider how to unify these text capabilities into one fully generalist agent that can also act in real time in the real world, in diverse environments and embodiments (p. 15).
In this work we learn a single network with the same weights across a diverse set of tasks.
Recent position papers advocate for highly generalist models, notably Schmidhuber (2018) proposing one big net for everything, and Bommasani et al. (2021) on foundation models. However, to our knowledge there has not yet been reported a single generalist trained on hundreds of vision, language and control tasks using modern transformer networks at scale.
“Single-brain”-style models have interesting connections to neuroscience. Mountcastle (1978) famously stated that “the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence (Hawkins and Blakeslee, 2004).
Sensory substitution provides another argument for a single model (Bach-y Rita and Kercel, 2003). For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect (p. 16).
There has been great recent interest in data-driven robotics (Cabi et al., 2019; Chen et al., 2021a). However, Bommasani et al. (2021) note that in robotics “the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data (p. 17).
Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. In the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch (p. 18).
Finally, descriptions of the range of tasks Gato is trained on:
We collect two separate sets of Atari environments. The first (that we refer to as ALE Atari) consists of 51 canonical games from the Arcade Learning Environment (Bellemare et al., 2013). The second (that we refer to as ALE Atari Extended) is a set of alternative games3 with their game mode and difficulty randomly set at the beginning of each episode.
For each environment in these sets we collect data by training a Muesli (Hessel et al., 2021) agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training.
Sokoban is a planning problem (Racanière et al., 2017), in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ahead of time is therefore necessary to succeed at this puzzle. We use a Muesli (Hessel et al., 2021) agent to collect training data.
BabyAI is a gridworld environment whose levels consist of instruction-following tasks that are described by a synthetic language. We generate data for these levels with the built-in BabyAI bot. The bot has access to extra information which is used to execute optimal solutions, see Section C in the appendix of (Chevalier-Boisvert et al., 2018) for more details about the bot. We collect 100,000 episodes for each level.
F.4. DeepMind Control Suite
The DeepMind Control Suite (Tassa et al., 2018; Tunyasuvunakool et al., 2020) is a set of physicsbased simulation environments. For each task in the control suite we collect two disjoint sets of data, one using only state features and another using only pixels. We use a D4PG (Barth-Maron et al., 2018) agent to collect data from tasks with state features, and an MPO (Abdolmaleki et al., 2018) based agent to collect data using pixels.
We also collect data for randomized versions of the control suite tasks with a D4PG agent. These versions randomize the actuator gear, joint range, stiffness, and damping, and geom size and density. There are two difficulty settings for the randomized versions. The small setting scales values by a random number sampled from the union of intervals [0.9, 0.95] ∪ [1.05, 1.1]. The large setting scales values by a random number sampled from the union of intervals [0.6, 0.8] ∪ [1.2, 1.4].
F.5. DeepMind Lab
DeepMind Lab (Beattie et al., 2016) is a first-person 3D environment designed to teach agents 3D vision from raw pixel inputs with an egocentric viewpoint, navigation, planning.
We collect data for 255 tasks from the DeepMind Lab, 254 of which are used during training, the left out task was used for out of distribution evaluation. Data is collected using an IMPALA (Espeholt et al., 2018) agent that has been trained jointly on a set of 18 procedurally generated training tasks. Data is collected by executing this agent on each of our 255 tasks, without further training.
F.6. Procgen Benchmark
Procgen (Cobbe et al., 2020) is a suite of 16 procedurally generated Atari-like environments, which was proposed to benchmark sample efficiency and generalization in reinforcement learning. Data collection was done while training a R2D2 (Kapturowski et al., 2018) agent on each of the environments. We used the hard difficulty setting for all environments except for maze and heist, which we set to easy.
F.7. Modular RL
Modular RL (Huang et al., 2020) is a collection of MuJoCo (Todorov et al., 2012) based continuous control environments, composed of three sets of variants of the OpenAI Gym (Brockman et al., 2016) Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of morphologies is generated by enumerating all possible subsets of limbs, and keeping only those sets that a) contain the torso, and b) still form a connected graph. This results in a set of variants with different input and output sizes, as well as different dynamics than the original morphologies. We collected data by training a single morphology-specific D4PG agent on each variant for a total of 140M actor steps, this was done for 30 random seeds per variant.
F.8. DeepMind Manipulation Playground
The DeepMind Manipulation Playground (Zolna et al., 2021) is a suite of MuJoCo based simulated robot tasks. We collect data for 4 of the Jaco tasks (box, stack banana, insertion, and slide) using a Critic-Regularized Regression (CRR) agent (Wang et al., 2020) trained from images on human demonstrations. The collected data includes the MuJoCo physics state, which is we use for training and evaluating Gato.
Meta-World (Yu et al., 2020) is a suite of environments for benchmarking meta-reinforcement learning and multi-task learning. We collect data from all train and test tasks in the MT50 mode by training a MPO agent (Abdolmaleki et al., 2018) with unlimited environment seeds and with access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state.
The specialist Meta-World agent described in Section 5.5 achieves 96.6% success rate averaged over all 50 Meta-World tasks. The detailed success rates are presented in Table 7. We evaluated agent 500 times for each task (pp. 36-7, 39-40).
Where the bar for impressive-but-subhuman performance is set by other models, which might possess diverse architectures very unlike Gato’s, Gato is a subhuman AGI. Gato generalizes to previously held-out tasks, including real-world robotics tasks, after 10 episodes of fine-tuning. (This is only because of context window limits, and we could test the model on few-shot learning with only context in these varied domains were that window larger.)
As with GPT-3, one scary feature of Gato’s success is that its architecture and hyperparameters aren’t strongly optimized for what it does. It’s basically a (relatively small!) large language model pressed into service as a generalist, and that just works. AGI is here, and it wasn’t that hard to engineer. Quoth Gwern, “Scaling just works.”
A Decade of Actual AI
My guess is that this heralds the beginning of “actual AI,” meaning AI reaching out into the world of atoms and not merely the world of bits. I don’t mean to disparage progress in ML; if you’re anything like me, the visceral impressiveness of GPT-2 and −3 are a big part of why you’re throwing yourself into trying to help with alignment! But I was promised a flying car in my childhood, dammit. I remember reading a kid’s science magazine that promised me I’d be commuting via space elevator by now! Will we get our household robots anytime soon?
If a slapped together, relatively small transformer like Gato can, with a minimum of fine-tuning (10 episodes), generalize well to previously unseen robotics tasks, then Gato’s descendants, heavily optimized for success, can plausibly do much better. For Bayesian reasons, the most important part of a secret is that the secret exists, and the genie is now out of the bottle regarding naïve transformers and their potential varied applications.
From an alignment perspective, this is horrifying. The worlds in which we can marshal sufficient Coordination between AI labs to prevent our doom … are worlds in which there aren’t a hundred disparate actors all rushing to AGI because there’s gold in them thar hills. But creating common knowledge of that seems to be what just happened, emboldening efforts to pursue and market, e.g., AI personal assistants.
Extrapolating from the successes of the past decade of successes in deep learning (recalling that transformers only date back to 2017), we should naively expect the equivalent-in-impressiveness of the GPT series in other domains, including in real-world robotics. Some spitballing implications: We should expect fully competent self-driving to be solved. We should expect customer-service chatbots to be solved—AI won’t pass the adversarial Turing test by 2032, but it will pass the average-case Turing test, and so be ready for deployment in relatively-low-stakes conversational roles. Factory robots get much better, sufficient to work in complicated domains like households and fast-food restaurants; the internet learns semantics and so websites take on forms much more interesting than static text, image, and video elements; we begin seriously using computers via input channels other than keyboard-and-mouse, as those are currently blocked on ML interpretation of messy human input.
Shallow Pattern Matching in the World of Atoms
While Gato constitutes a kind of subhuman AGI, it is not anything like human-grade AGI. Fundamentally, as its architecture has not substantially changed from, e.g., GPT-3, Gato’s intrinsic limits don’t fundamentally exceed those in GPT-3.
Nostalgebraist on the GPT series’ capabilities:
I don’t even know how many tens of thousands of LM samples I’ve read by now. (Just my bot alone has written 80,138 posts—and counting—and while I no longer read every new one these days, I did for a very long time.)
Read enough, and you will witness the LM both failing and succeeding at anything your mind might want to carve out as a “capability.” You see the semblance of abstract reasoning shimmer across a strings of tokens, only to yield to suddenly to absurd, direct self-contradiction. You see the model getting each fact right, then wrong, then right. I see no single, stable trove of skills being leveraged here and there as needed. I just see stretches of success and failure at imitating ten thousand different kinds of people, all nearly independent of one another, the products of barely-coupled subsystems.
This is hard to refute, but I think this is something you only grok when you read enough LM samples—where “enough” is a pretty big number.
GPT makes many mistakes, but many of these mistakes are of types which it only makes rarely. Some mistake the model makes only every 200 samples, say, is invisible upon one’s first encounter with GPT. You don’t even notice that model is “getting it right,” any more than you would notice a fellow human “failing to forget” that water flows downhill. It’s just part of the floor you think you’re standing on.
The first time you see it, it surprises you, a crack in the floor. By the fourth time, it doesn’t surprise you as much. The fortieth time you see the mistake, you don’t even notice it, because “the model occasionally gets this wrong” has become part of the floor.
Eventually, you no longer picture of a floor with cracks in it. You picture a roiling chaos which randomly, but regularly, coalesces into ephemeral structures possessing randomly selected subsets of the properties of floors.
Large language models today sit in this weird uncanny valley of ability, where they are both shockingly good at writing (GPT-2 at its best writing a B-grade high school history essay) and pick up the idiot ball at moments no human would (nostalgebraist on GPT-3′s inhuman metafictional tendencies). Gato exports this uncanny valley of competence that no human possesses into the world of atoms.
In the way that PaLM cannot pass an adversarial Turing test, correspondingly scaled-up Gato won’t successfully control a humanoid robot in unfamiliar domains on arbitrary unfamiliar physical tasks. But Gato still exports a whole lot of competence into the physical world (and into a whole host of varied tasks in virtual environments too.) Even if large language models and scaled-up Gato peter out at some point, the lack of intense optimization work put into them so far suggests that we haven’t come close to mining out this capabilities vein, and we should only expect nostalgebraist’s “cracks in the floor” to narrow and often close up in the coming decade of AI capabilities progress.
A Milestone and a Plea
I may be totally off-base here. This summary and projection is built on my very limited model of AI capabilities. I hope to God I’m just confused, and am eager to update my model.
But if subhuman AGI is here, and if we’re kicking off the final race to human-grade AGI now, even just the very beginning of it … then timelines are extraordinarily short. Others mentioned in earlier Gato posts that their models predicted something like this; reading the Gato paper, I realize that my model (insofar as I had one) was surprised by this. I am scared and have shrunk my timelines.
Please, let’s do something about AGI alignment.
Admittedly a bit of abuse of terminology, if by “AGI” we usually mean human-equivalent AI across a range of diverse task domains. By “subhuman AGI” here, I mean an AI that performs a wide range of disparate tasks at levels only modestly below typical human performance.
Assuming meaningful ML applications to the world of atoms aren’t completely forbidden by regulation in advance of their deployment. I frame everything below with this caveat in mind.