glazgogabgolab

Karma: 52

glazgogabgolab 18 Mar 2026 15:52 UTC
7 points
0
in reply to: Steven Byrnes’s comment on: You can’t imitation-learn how to continual-learn
I think you’re conceding [that] this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain
You’re right, I misread the graph.
And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
I also concede that this claim is probably right for Figure 3.
I still don’t think this is true for Figure 5 but i’m less confident now having realised how much my assumptions about the underspecified parts of this paper were based on what I assumed about their GPICL paper.

glazgogabgolab 17 Mar 2026 22:08 UTC
2 points
0
in reply to: Steven Byrnes’s comment on: You can’t imitation-learn how to continual-learn
my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that
This was my read too. Unfortunately we don’t have access to the source code but this is the assumption i made after seeing the graph on the left in Figure 3. Around 40 episodes in, their PPO agent is still struggling but their Gap 8 GLA is near optimal. But that Gap 8 GLA was necessarily trained on data from a PPO agent that ran for 8 times longer.

glazgogabgolab 17 Mar 2026 21:03 UTC
3 points
−1
in reply to: Steven Byrnes’s comment on: You can’t imitation-learn how to continual-learn
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre
As I understand it your critique of that line of in-context RL research was that the meta-training and meta-testing tasks were too similar and too simple. I don’t think the former is true for any of the papers I linked (the latter is debatable). GLAs train on a single task, but achieve generalization by very heavily augmenting data from that task, and can be applied to new tasks that are as different as a “held-out Atari game”. Likewise OmniRL trains on procedurally generated MDPs and adapts to novel discrete RL benchmark tasks, much like the incorrect algorithmic distillation gloss you described, but for real this time (probably).
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)?
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used. I’m not sure why the source agent is so garbage but the GLA’s improvement on it, despite being trained on it, is illustrative.
What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one?
GLAs augment their meta-training data with fixed random linear projections for visual input fixed random action-space permutations actions. In Figure 5 “not randomized” refers to GLAs that were simply trained on RL training histories without this randomization, i.e. just the imitation learning on abbreviated (“gapped”) learning history. The PPO agent is garbage because it’s the same barely better than random policy from Figure 3.
Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
OmniRL (from Towards Large-Scale In-Context RL) is a spiritually similar approach, and open source (https://github.com/FutureAGI/L3C_Baselines/tree/main/projects/OmniRL), but fails to exceed decent dedicated PPO agents (see Table 1). However OmniRL also achieves generalization to completely novel tasks, despite only being trained on procedurally generated discrete MDPs. It also exhibits the characteristic continual learning curve you’d expect from a proper learning algorithm (see Figure 5).
The catch is that the AnyMDP task generator is a significantly more structured task generator than the randomization used in GPICL and the resulting agent is restricted to discrete input/action spaces.

glazgogabgolab 17 Mar 2026 16:57 UTC
4 points
0
on: You can’t imitation-learn how to continual-learn
Here’s a possible counterexample: Towards General-Purpose In-Context Learning Agents.
They train a meta-RL agent using imitation learning on another RL agent’s learning history. The trained meta-RL agent isn’t limited to minor variations of the meta-training task (as is usually the case), but can learn completely new (although fairly basic) continuous control tasks, each very different from the one it was trained on, using only activations at inference.
The author’s prior work in SSL (Meta-Learning Transformers to Improve In-Context Generalization) is also of interest for understanding just how far this can be pushed, as is more recent, more applicable RL research in the same space (Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds)
I don’t see why this couldn’t also be combined with one of the many transformer memory augmentation techniques or recurrent transformer formulations to produce a frozen-weight, continually learning, meta-RL transformer agent from purely imitation learning.
I know that calling this “purely” imitation learning is a bit of stretch since we first need to train an RL agent with RL to collect the training data, but the meta-RL agent is trained by imitation. I also suspect that the recorded training history can probably be replaced by simply recording data from a trained reference agent (or human) and noising/denoising actions to simulate a learning trajectory. (see for example: Emergence of In-Context Reinforcement Learning from Noise Distillation).

glazgogabgolab 13 May 2022 8:04 UTC
20 points
0
in reply to: Rohin Shah’s comment on: Deepmind’s Gato: Generalist Agent
there was a result (from Pieter Abbeel’s lab?) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task
Pretrained Transformers as Universal Computation Engines
From the abstract:
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular [...] a variety of sequence classification tasks spanning numerical, computation, vision, and protein fold prediction

glazgogabgolab 18 Apr 2022 8:43 UTC
17 points
0
in reply to: Lucius Bushnaq’s comment on: Lies Told To Children
Given your perspective, you may enjoy: Lies Told To Children: Pinocchio, Which I found posted here.
Personally I think I’d be fine with the bargain, but having read that alternative continuation, I think I better understand how you feel.

glazgogabgolab 28 Jan 2022 23:35 UTC
3 points
0
AF
in reply to: Steven Byrnes’s comment on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
Oops, strangely enough I just wasn’t thinking about that possibility. It’s obvious now, but I assumed that SL vs RL would be a minor consideration, despite the many words you’ve already written on reward.

glazgogabgolab 28 Jan 2022 0:58 UTC
3 points
0
AF
in reply to: Steven Byrnes’s comment on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
Hey Steve, I might be wrong here but I don’t think Jon’s question was specifically about what architectures you’d be talking about. I think he was asking more specifically about how to classify something as Brain-like-AGI for the purposes of your upcoming series.
The way I read your answer makes it sound like the safety considerations you’ll be discussing depend more on whether the NTM is trained via SL or RL rather than whether it neatly contains all your (soon to be elucidated) Brain-like-AGI properties.
Though that might actually have been what you meant so I probably should have asked for clarification before I presumptively answered Jon for you.

glazgogabgolab 28 Jan 2022 0:53 UTC
1 point
0
AF
in reply to: Jon Garcia’s comment on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
If I’m reading your question right I think the answer is:
I’m going to make a bunch of claims about the algorithms underlying human intelligence, and then talk about safely using algorithms with those properties. If our future AGI algorithms have those properties, then this series will be useful, and I would be inclined to call such an algorithm “brain-like”.
i.e. The distinction depends on whether or not a given architecture has some properties Steve will mention later. Which, given Steve’s work, are probably the key properties of “A learned population of Compositional Generative Models + A largely hardcoded Steering Subsystem”.

glazgogabgolab 18 Jan 2022 0:31 UTC
2 points
0
on: How I’m thinking about GPT-N
Regarding “posts making a bearish case” against GPT-N, there’s Steve Byrnes’, Can you get AGI from a transformer.
I was just in the middle of writing a draft revisiting some of his arguments, but in the meantime one claim that might be of particular interest to you is that: ”...[GPT-N type models] cannot take you more than a couple steps of inferential distance away from the span of concepts frequently used by humans in the training data”