I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
“Act-based approval-directed agents”, for IDA skeptics
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn’t get around with infinite context window/training data/etc?
The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯
OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.
If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.
So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.
The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.
(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?
If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!
So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.
I’m pretty confused. This comment is just trying to get on the same page before I start arguing :-)
I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn.
My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
Does that match what you’re trying to say here?
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre, and I’m pretty skeptical of them too. Also confused.
I mostly tried to read your first suggestion, Towards General-Purpose In-Context Learning Agents.
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
The transfer-learning thing (fig 5) is hard to interpret. What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one? What would be the “test return” for a random policy, or the no-op policy, or any other relevant baseline, for all four of these? Why is their PPO so bad? Were they using crappy PPO hyperparameters to make GLA look better by comparison? How many other environments did they try but bury in their file drawer? Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
I could be wrong, perhaps you’re more familiar with this literature than I am.
Thanks!
I must imagine that there’s some neuroscience literature on sexual attraction, where brain region activations are cross-referenced with self-reported feelings of attraction, and referencing this would help support the point.
Alas it’s much less useful than you’d think, at least for the kinds of questions that I’m interested in. My view is it’s extremely difficult to learn anything useful from fMRI studies, at least for this kind of question. I think the important nuts-and-bolts questions would be answerable by measuring the activity and interconnectivity of tiny cell groups in the hypothalamus, but that’s not experimentally possible as of today.
(fMRI is not helpful for that: the relevant cell groups are all too small and physically proximate (sometimes even intermingled) to tell them apart by location as opposed to by receptor expression etc., and it’s moot anyway because fMRI just can’t measure the human hypothalamus at all, it’s too close to a major artery or something, I forget.)
The claim “modern LLMs can pursue goals and act like agents” does not contradict the claim “modern LLMs get their capabilities primarily from imitative learning”, right? Because there are examples in the pretraining data (including in the specially-commissioned proprietary expert-created data) where human-created text enacts the pursuit of goals. Right? See also here.
And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows.
I do think LLMs struggle when they depart from what’s in the pretraining data, but the meaning of that is a bit tricky to pin down. Like, if I ask you to “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”, you can do that in a fraction of a second, and correctly answer follow-up questions about that scenario, even though this specific mental image has never happened before in the history of the world. And LLMs can do that kind of thing too. Is that “departing from what’s in the pretraining data”? My answer is: No, not in the sense that matters. When you say your LLM “make[s] conclusions that no human knows”, I suspect it’s a similar kind of thing: it’s not “departing from the pretraining data” in the sense that matters. Indeed, anything that a third party can simply read and immediately understand is not “departing from the pretraining data” in the sense that matters, even if the person didn’t already know it.
By contrast, if you don’t know linear algebra, you can’t simply read a linear algebra textbook and immediately understand it. You need to spend many days and weeks internalizing these new ideas.
Anyway, in the post, I tried to be maximally clear-cut, by using the example of how billions of humans over thousands of years inventing language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch, without angels dropping new training data from the heavens. I very strongly don’t believe that billions of LLMs over thousands of years, in a sealed datacenter without any human intervention or human data, could do that. That would be real departure from the pretraining data. See also here and the first section here.
FYI I just wrote a post You can’t imitation-learn how to continual-learn which is related to this thread.
As an analogy, take an adult from 30000 BC, call him Grog, and give him access to a database of “actionable knowledge of 1000 textbooks”, and then tell him to go invent a less expensive solid-state LIDAR system. Will he immediately start making progress? I say “obviously not”.
What would the “actionable knowledge” look like? Maybe one piece of “actionable knowledge” is some fact from the ANSI Z136.1 laser eye safety manual (“For pulsed lasers of 1ns–50μs pulse duration and beam diameter 1 cm, at viewing distance 20 cm, the diffusely reflected beam energy cannot safely exceed 0.022 × CA joules, where CA is the correction factor for IR-A light based on reduced absorption properties of melanin”.) OK, Grog looks at that and immediately has some questions. What does “laser” mean? What is a “pulsed laser”? What does “ns” mean? What does “beam diameter” mean? What does “diffusely reflected” mean? Etc. etc.
This “knowledge” is not in fact “actionable” because Grog can’t make heads or tails of it.
And ditto for pretty much every other item in the database. Right?
What Grog would need to do is spend years developing a deep understanding of optics and lasers and so on before he could even start inventing a new LIDAR system. Of course, that’s what modern LIDAR inventors do: spend years developing understanding. Once Grog has that understanding, then yeah sure, convenient database access to relevant facts would be helpful, just as modern LIDAR inventors do in fact keep the ANSI Z136.1 manual in arm’s reach.
Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.
I claim that this all transfers quite well to LLMs. It’s just that LLMs already have decent “understanding” of everything that humans have ever written down anywhere on the internet or in any book, thanks to pretraining. So in our everyday interactions with LLMs, we don’t as often come across situations where the LLM is flailing around like poor Grog. But see 1, 2.
Those are cool ideas, but I don’t think they qualify as (what I’m calling) “real” continual learning, as defined in the section “Some intuitions on how to think about ‘real’ continual learning”.
See everything I wrote in the section “Some intuitions on how to think about ‘real’ continual learning”. The thing you’re describing is definitely not (what I’m calling) “real” continual learning.
Should the thing you’re describing be called “continual learning” at all? No opinion. Call it whatever you want.
No opinion about (a) and (b), but Bayesian inference can only do as well its hypothesis space, and I think the true hypothesis here is WAY outside the hypothesis space, regardless of context size. That’s what I was trying to get across with that table I put into the OP.
So maybe that’s (c), but I don’t really know what you mean by “capacity” in this context.
Your last paragraph sounds to me like brainstorming how to build a continual learning setup for LLMs. As I mentioned at the bottom, such a system might or might not exist, but that would be out of scope for this post. If something in that genre worked, the “continual learning” in question would be coming from PyTorch code that assembles data and runs SGD in a loop, not from imitation learning, if I’m understanding your text correctly.
You can’t imitation-learn how to continual-learn
(Thinking out loud.) I have really been unimpressed with LLM-assisted writing I’ve seen to date (and yes that includes “cyborg” writing from established users), and would be happy to see it banned entirely (maybe with exceptions for straightforward audio transcription and machine translation). Especially given the “second-order effects on culture” that Raemon mentioned here. Like, LLMs help people write, but removing friction sometimes makes things worse not better.
Then I was thinking: Is there any situation where I would use an LLM block myself? Hmm, maybe for “boilerplate” explanations of well-known background information—the same kinds of situations where I might otherwise block-quote from a textbook.
Well anyway, the current system seems OK. I guess the idea is that the blocks are subtle enough that people will feel little hesitation in using them, which is good, because then I’ll know who to ignore :-P
Thanks!
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any.
For one thing, my headline claim is “ruthless sociopath”, not “maximizing”. “Ruthless sociopath” is pointing to something that’s missing (intrinsic concern for the welfare of other people), not something that’s present (behaviors that maximize something in the world).
For another thing, strictly speaking, perfect maximization is impossible without omniscience.
For another thing, if a powerful ASI cares about increasing staples, and also paperclips, and also any number of other office supplies, that doesn’t help us, it will still wipe out humanity and create a future devoid of value. Indeed, even maximizers can “care” about multiple things. E.g. if a utility-maximizer has utility function U = log(log(staples)) + log(log(paperclips)) then it will stably split its time between staple and paperclip production forever. [I put in the “log log” to ensure strongly diminishing returns, enough to overcome any economies of scale.]
Given it will have the whole of human history as training data one of the lessons it will have absorbed is ruthless prioritisation of a single goal tends to provoke counter coalitions. The smart thing to do is manage within an ecosystem of other AI and humans. Not maximise against them (which is a fraught and unstable pattern).
I agree that a ruthless sociopath agent, one which has callous indifference to whether you or anyone else lives or dies, will nevertheless act kind to you, when acting kind to you is in its self-interest. And then if the situation changes, such that acting kind to you stops being in its self-interest, then it will not hesitate to stab you in the back (betray you, murder you, blackmail you, whatever). And even before that, it will be constantly entertaining the idea of stabbing you in the back, and then deciding that this idea is (currently) inadvisable, and thus continuing to act kindly towards you.
Hopefully we can agree that this is not a description of normal human relations.
…But even if this is not normal human relations, one could argue that it’s fine, because we can still build a good healthy civilization out of AIs that all have this kind of disposition. And indeed, there are people who make that argument. But I strongly disagree. I was writing about this topic recently, see §5 of my post “6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa”: “The human intuition that societal norms and institutions are mostly stably self-enforcing”.
in the limit of generating very many reasoning traces from a model, fine-tuning the model on this data with full-batch ideal gradient descent does not change the model at all, because it is already the globally (and so also locally) optimal log loss predictor for its own sampling distribution
Hmm, thinking about it more, I think you’re right (no change) if you draw the samples with temperature T=1; and my earlier comment was right (mode collapse, i.e. ever-increasing confidence in the modal next token, approaching 100%) if you draw the samples with temperature 0≤T<1, and repeat enough times. And if you use temperature T>1 then you get, umm, the opposite of mode collapse, where it approaches a uniform probability distribution when you repeat enough times. Right? (I’m not totally sure.) (I agree that this is a fun but irrelevant side-track.)
You tell your LLM story with Opus 9 selecting data for “pretraining Opus 10” but you could equally well change the story to have Opus 9 selecting data for “further fine-tuning itself”, and now it’s a story of how to get continual learning to work in LLMs. Doesn’t really matter, it amounts to the same thing. I’ll use the fine-tuning / continual learning description here because I think it makes things a bit easier to talk about, but it doesn’t matter, you can translate my next paragraph back into the other frame if you prefer.
Anyway, I don’t think it would work (although obviously we’ll find out one way or the other soon enough). If you want to make progress, it’s not enough to for an LLM to “accumulate experience” and then fine-tune on that experience. For example, if an LLM outputs a bunch of tokens, then you fine-tune that very same LLM on those very same tokens, then it won’t make the LLM smarter, it will only cause mode collapse. Instead you would need to do something like: have the LLM try to figure out what’s true by thinking for a while, produce a final artifact, and train only on that artifact but not the thinking trace. That’s not an obviously crazy idea, and maybe it would work a little bit, but I think what would happen eventually is that the LLM would make mistakes, then it would lock in those mistakes by fine-tuning, and then it would have confident wrong ideas that leads it to make more mistakes, etc., and in the long term it would get dumber and dumber, not smarter and smarter. I don’t think this kind of approach can lead to human-like open-ended creation of knowledge, akin to the way that human mathematicians invented math from scratch without proof assistants (see §1.1 of my “Sharp Left Turn” post).
I wrote:
Then you replied:
But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.
Right?
Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)