I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
I don’t think this is related to the points I was making in the post … But happy to chat about that anyway.
Yeah sure, common sense says that smart people will tend to enjoy being in more meritocratic intellectual fields, rather than less meritocratic ones, and also that fields in general tend to be more meritocratic when quality is easy to judge (although other things matter too, e.g. glamorous fields have it tougher because they attract grifters).
See e.g. what I wrote here about experimental science.
The mathematics community has successfully kept the cranks out, as far as I know, but two grimly amusing failures (in my controversial opinion) are: (1) in the 2000s, the (correct) theoretical physics consensus that we should be focusing on string theory was somewhat broken by an invasion of people unable to tell good physics theory from bad (e.g. “loop quantum gravity”), and there were enough of such people (including department chairs etc) that they broke the blockade and wound up with positions and credentials; (2) this funny anecdote in Dan Dennett’s memoir:
The hegemony of the analytic philosophers evaporated in 1979, at the Eastern Division meeting of the APA [American Philosophical Association] in Boston, when a coup d’état was staged by a group of mostly American but Continental philosophers who called themselves pluralists (let a thousand flowers bloom). I wonder how many of today’s young philosophers and graduate students have ever heard about this. It was an academic earthquake at the time. Frustrated by the short shrift given them by members of the “analytic monolith,” these philosophers studied the bylaws of the APA and discovered that although for decades the nominating committee had put forward a single candidate for vice president who was then elected by acclaim and would succeed as president the following year, the rules allowed nominations from the floor and actual elections! In secret, the pluralists put together their slate, prepared their challenges to the parliamentarian and other officers, and made sure their members were all set to descend en masse on the lightly attended business meeting and take over the APA Eastern Division. About half an hour before the meeting, their security broke down: a coup was rumored to be in the offing, and we monolith members were rounded up in the bar and hustled to the meeting to try to fend off the usurpation. Dick Rorty was president that year, and it was an irony (one of his favorite topics) that he—the most ecumenical and open-minded of the “analytic monolith” leaders—presided over the meeting, while Tom Nagel executed his duties as parliamentarian with aplomb. There were nominating speeches and rebuttals, the most memorable of which was by Ruth Marcus, whose Yale colleague John Smith, a philosopher of religion and a theologian, was the pluralists’ candidate. She explicitly trashed his whole career, his character, his books. I had never heard a philosopher speak so ill of a colleague in public, and seldom in private.
We lost. The establishment had nominated Adolf Grünbaum, a Pittsburgh philosopher of science, to be the new vice president. Not wanting to offend innocent Adolf, the victorious pluralists nominated and elected him vice president the following year, so that in 1982 he finally got to deliver the presidential address he had expected to give earlier. He did not accept the olive branch with equanimity. Adolf was famous for his tirades against Freud as an unscientific poseur, and his address was vintage Grünbaum. I happened to follow a cluster of pluralists out of the hall at the close of his address and overheard the reply when a pluralist who had stayed away asked how Grünbaum’s address had gone: “It was nasty, brutish and long.”
Thereafter, the APA’s programs were filled with papers on topics, and by philosophers, that would never have made the cut before the pluralist coup. Was this a good thing? Yes, said some monolith members, since it meant there was more guilt-free time to spend in the bar at conventions. Yes, said others, since the pluralists had justice on their side. My verdict is mixed. Still, the published programs of the APA meetings list dozens of talks whose titles are so ripe for parody that when I recently perused a few looking for likely examples to anonymize, I had difficulty “improving” on the actual candidates, but ask yourself whether you are aching to go to the sessions where the following talks will be given:
“The Ineffability of History and the Problem of the Unitary Self”
“Dialectical Encroachment: Humiliation and Integrity”
“Can Relationalistic Ontology Avoid Incoherence through a Recursive Metatheory?”
“Art as War: The Resilience of Autonomy”
Having said all that…
If your proposal is:
von Neumann and Tao did math-y stuff rather than other stuff because they got adulation when they did math-y stuff and they got heckled by idiots when they did other stuff.
…then I think that’s part of it but not all of it. I would note that they presumably got good at math by thinking about math all the time, and if they were thinking about math all the time, it’s probably because they found it very satisfying and enjoyable to think about math. I have a kid like that—when he was like 8 years old, I might be talking about politics at dinner or whatever, and he would interrupt me to share something he just thought of about perfect squares that he found very exciting. I.e., some people, when their mind is wandering, think about other people, and some people think about sports, and he was evidently thinking about perfect squares. Anyway, if a person intrinsically enjoys thinking about numbers and symbols, then it stands to reason that they would probably choose a career where they get to think about numbers and symbols all day.
I sometimes wonder why the AI x-risk community was so overrepresented in physicists in the early-ish days (e.g. Hawking, Tegmark, Wilczek, Musk, Tallinn, Rees, Omohundro, Aguirre…). The best I can come up with is that people who self-select into physics are unusually likely to have the combination of (1) smart & quantitative, and (2) really, deeply, profoundly bothered by not understanding important things about the world.
“Act-based approval-directed agents”, for IDA skeptics
I wrote:
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
Then you replied:
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used…
But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.
Right?
Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn’t get around with infinite context window/training data/etc?
The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯
OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.
If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.
So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.
The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.
(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?
If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!
So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.
I’m pretty confused. This comment is just trying to get on the same page before I start arguing :-)
I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn.
My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
Does that match what you’re trying to say here?
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre, and I’m pretty skeptical of them too. Also confused.
I mostly tried to read your first suggestion, Towards General-Purpose In-Context Learning Agents.
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
The transfer-learning thing (fig 5) is hard to interpret. What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one? What would be the “test return” for a random policy, or the no-op policy, or any other relevant baseline, for all four of these? Why is their PPO so bad? Were they using crappy PPO hyperparameters to make GLA look better by comparison? How many other environments did they try but bury in their file drawer? Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
I could be wrong, perhaps you’re more familiar with this literature than I am.
Thanks!
I must imagine that there’s some neuroscience literature on sexual attraction, where brain region activations are cross-referenced with self-reported feelings of attraction, and referencing this would help support the point.
Alas it’s much less useful than you’d think, at least for the kinds of questions that I’m interested in. My view is it’s extremely difficult to learn anything useful from fMRI studies, at least for this kind of question. I think the important nuts-and-bolts questions would be answerable by measuring the activity and interconnectivity of tiny cell groups in the hypothalamus, but that’s not experimentally possible as of today.
(fMRI is not helpful for that: the relevant cell groups are all too small and physically proximate (sometimes even intermingled) to tell them apart by location as opposed to by receptor expression etc., and it’s moot anyway because fMRI just can’t measure the human hypothalamus at all, it’s too close to a major artery or something, I forget.)
The claim “modern LLMs can pursue goals and act like agents” does not contradict the claim “modern LLMs get their capabilities primarily from imitative learning”, right? Because there are examples in the pretraining data (including in the specially-commissioned proprietary expert-created data) where human-created text enacts the pursuit of goals. Right? See also here.
And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows.
I do think LLMs struggle when they depart from what’s in the pretraining data, but the meaning of that is a bit tricky to pin down. Like, if I ask you to “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”, you can do that in a fraction of a second, and correctly answer follow-up questions about that scenario, even though this specific mental image has never happened before in the history of the world. And LLMs can do that kind of thing too. Is that “departing from what’s in the pretraining data”? My answer is: No, not in the sense that matters. When you say your LLM “make[s] conclusions that no human knows”, I suspect it’s a similar kind of thing: it’s not “departing from the pretraining data” in the sense that matters. Indeed, anything that a third party can simply read and immediately understand is not “departing from the pretraining data” in the sense that matters, even if the person didn’t already know it.
By contrast, if you don’t know linear algebra, you can’t simply read a linear algebra textbook and immediately understand it. You need to spend many days and weeks internalizing these new ideas.
Anyway, in the post, I tried to be maximally clear-cut, by using the example of how billions of humans over thousands of years inventing language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch, without angels dropping new training data from the heavens. I very strongly don’t believe that billions of LLMs over thousands of years, in a sealed datacenter without any human intervention or human data, could do that. That would be real departure from the pretraining data. See also here and the first section here.
FYI I just wrote a post You can’t imitation-learn how to continual-learn which is related to this thread.
As an analogy, take an adult from 30000 BC, call him Grog, and give him access to a database of “actionable knowledge of 1000 textbooks”, and then tell him to go invent a less expensive solid-state LIDAR system. Will he immediately start making progress? I say “obviously not”.
What would the “actionable knowledge” look like? Maybe one piece of “actionable knowledge” is some fact from the ANSI Z136.1 laser eye safety manual (“For pulsed lasers of 1ns–50μs pulse duration and beam diameter 1 cm, at viewing distance 20 cm, the diffusely reflected beam energy cannot safely exceed 0.022 × CA joules, where CA is the correction factor for IR-A light based on reduced absorption properties of melanin”.) OK, Grog looks at that and immediately has some questions. What does “laser” mean? What is a “pulsed laser”? What does “ns” mean? What does “beam diameter” mean? What does “diffusely reflected” mean? Etc. etc.
This “knowledge” is not in fact “actionable” because Grog can’t make heads or tails of it.
And ditto for pretty much every other item in the database. Right?
What Grog would need to do is spend years developing a deep understanding of optics and lasers and so on before he could even start inventing a new LIDAR system. Of course, that’s what modern LIDAR inventors do: spend years developing understanding. Once Grog has that understanding, then yeah sure, convenient database access to relevant facts would be helpful, just as modern LIDAR inventors do in fact keep the ANSI Z136.1 manual in arm’s reach.
Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.
I claim that this all transfers quite well to LLMs. It’s just that LLMs already have decent “understanding” of everything that humans have ever written down anywhere on the internet or in any book, thanks to pretraining. So in our everyday interactions with LLMs, we don’t as often come across situations where the LLM is flailing around like poor Grog. But see 1, 2.
Those are cool ideas, but I don’t think they qualify as (what I’m calling) “real” continual learning, as defined in the section “Some intuitions on how to think about ‘real’ continual learning”.
See everything I wrote in the section “Some intuitions on how to think about ‘real’ continual learning”. The thing you’re describing is definitely not (what I’m calling) “real” continual learning.
Should the thing you’re describing be called “continual learning” at all? No opinion. Call it whatever you want.
No opinion about (a) and (b), but Bayesian inference can only do as well its hypothesis space, and I think the true hypothesis here is WAY outside the hypothesis space, regardless of context size. That’s what I was trying to get across with that table I put into the OP.
So maybe that’s (c), but I don’t really know what you mean by “capacity” in this context.
Your last paragraph sounds to me like brainstorming how to build a continual learning setup for LLMs. As I mentioned at the bottom, such a system might or might not exist, but that would be out of scope for this post. If something in that genre worked, the “continual learning” in question would be coming from PyTorch code that assembles data and runs SGD in a loop, not from imitation learning, if I’m understanding your text correctly.
You can’t imitation-learn how to continual-learn
(Thinking out loud.) I have really been unimpressed with LLM-assisted writing I’ve seen to date (and yes that includes “cyborg” writing from established users), and would be happy to see it banned entirely (maybe with exceptions for straightforward audio transcription and machine translation). Especially given the “second-order effects on culture” that Raemon mentioned here. Like, LLMs help people write, but removing friction sometimes makes things worse not better.
Then I was thinking: Is there any situation where I would use an LLM block myself? Hmm, maybe for “boilerplate” explanations of well-known background information—the same kinds of situations where I might otherwise block-quote from a textbook.
Well anyway, the current system seems OK. I guess the idea is that the blocks are subtle enough that people will feel little hesitation in using them, which is good, because then I’ll know who to ignore :-P
Thanks!
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any.
For one thing, my headline claim is “ruthless sociopath”, not “maximizing”. “Ruthless sociopath” is pointing to something that’s missing (intrinsic concern for the welfare of other people), not something that’s present (behaviors that maximize something in the world).
For another thing, strictly speaking, perfect maximization is impossible without omniscience.
For another thing, if a powerful ASI cares about increasing staples, and also paperclips, and also any number of other office supplies, that doesn’t help us, it will still wipe out humanity and create a future devoid of value. Indeed, even maximizers can “care” about multiple things. E.g. if a utility-maximizer has utility function U = log(log(staples)) + log(log(paperclips)) then it will stably split its time between staple and paperclip production forever. [I put in the “log log” to ensure strongly diminishing returns, enough to overcome any economies of scale.]
Given it will have the whole of human history as training data one of the lessons it will have absorbed is ruthless prioritisation of a single goal tends to provoke counter coalitions. The smart thing to do is manage within an ecosystem of other AI and humans. Not maximise against them (which is a fraught and unstable pattern).
I agree that a ruthless sociopath agent, one which has callous indifference to whether you or anyone else lives or dies, will nevertheless act kind to you, when acting kind to you is in its self-interest. And then if the situation changes, such that acting kind to you stops being in its self-interest, then it will not hesitate to stab you in the back (betray you, murder you, blackmail you, whatever). And even before that, it will be constantly entertaining the idea of stabbing you in the back, and then deciding that this idea is (currently) inadvisable, and thus continuing to act kindly towards you.
Hopefully we can agree that this is not a description of normal human relations.
…But even if this is not normal human relations, one could argue that it’s fine, because we can still build a good healthy civilization out of AIs that all have this kind of disposition. And indeed, there are people who make that argument. But I strongly disagree. I was writing about this topic recently, see §5 of my post “6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa”: “The human intuition that societal norms and institutions are mostly stably self-enforcing”.
See also a thread here where I was also complaining about this.