I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
I think Yann LeCun thinks “AGI in 2040 is perfectly plausible”, AND he believes “AGI is so far away it’s not worth worrying about all that much”. It’s a really insane perspective IMO. As recently as like 2020, “AGI within 20 years” was universally (correctly) considered to be a super-soon forecast calling for urgent action, as contrasted with the people who say “centuries”.
It’s a funny comment because legally Conjecture is for-profit and OpenAI is not. It just goes to show that the various internal and external pressures and incentives on an organization and its staff are not encapsulated by glancing at their legal status—see also my comment here.
Anyway, I don’t think Connor is being disingenuous in this particular comment, because he has always been an outspoken advocate for government regulation of all AGI-related companies including his own.
I don’t think it’s crazy or disingenuous in general to say “This is a terrible system, so we’re gonna loudly criticize it and advocate to change it. But meanwhile / in parallel, we’re gonna work within the system we got, and do the best we can.” And “the system we got” is that private organizations are racing to develop AGI.
I’m not so sure it’s the same—my interpretation was something like:
Yudkowsky plan: Make an AI that designs a certain kind of nanobot
Seth plan: Make an AI that does what I tell it to do, and then I will tell it to design a certain kind of nanobot
For example, in this comment, @Rob Bensinger was brainstorming nanobot-specific things that one might put into the source code. (Warning that Rob is not Eliezer.) (Related.)
@nostalgebraist bites that bullet here:
“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.
What’s more, even calling it a problem statement is misleading, because it’s (almost) the most general problem statement possible for any arbitrary task. If you try to formalize a concept like “doing a task well,“ or even “being an entity that acts freely and wants things,” in the most generic terms with no constraints whatsoever, you end up writing down “reinforcement learning.”and so does Russell & Norvig 3rd edition
Reinforcement learning might be considered to encompass all of AI.
…That’s not my perspective though.
For my part, there’s a stereotypical core of things-I-call-RL which entails:
(A) There’s a notion of the AI’s outputs being better or worse
(B) But we don’t have ground truth (even after-the-fact) about what any particular output should ideally have been
(C) Therefore the system needs to do some kind of explore-exploit.
By this (loose) definition, both model-based and model-free RL are central examples of “reinforcement learning”, whereas LLM self-supervised base models are not reinforcement learning (cf. (B), (C)), nor are ConvNet classifiers trained on ImageNet (ditto), nor are clustering algorithms (cf. (A), (C)), nor is A* or any other exhaustive search within a simple deterministic domain (cf. (C)), nor are VAEs (cf. (C)), etc.
(A,B,C) is generally the situation you face if you want an AI to win at videogames or board games, control bodies while adapting to unpredictable injuries or terrain, write down math proofs, design chips, found companies, and so on.
This (loose) definition of RL connects to AGI safety because (B-C) makes it harder to predict the outputs of an RL system. E.g. we can plausibly guess that an LLM base model, given internet-text-like prompts, will continue in an internet-text-typical way. Granted, given OOD prompts, it’s harder to say things a priori about the output. But that’s nothing compared to e.g. AlphaZero or AlphaStar, where we’re almost completely in the dark about what the trained model will do in any nontrivial game-state whatsoever. (…Then extrapolate the latter to human-level AGIs acting in the real world!)
(That’s not an argument that “we’re doomed if AGI is based on RL”, but I do think that a very RL-centric AGI would need tailored approaches to thinking about safety and alignment that wouldn’t apply to LLMs; and I likewise think that likewise a massive increase in the scope and centrality of LLM-related RL (beyond the RLHF status quo) would raise new (and concerning) alignment issues, different from the ones we’re used to with LLMs today.)
Yeah most of the time I’ll open my to-do list and just look at one the couple very leftmost columns, and the column has maybe 3 items, and then I’ll pick one and do it (or pick a few and schedule them for that same day).
Occasionally I’ll look at a column farther to the right, and see if any ought to be moved left or right. The further right, the less often I’m checking.
short lists fit much better into working memory
IMO the main point of a to-do list is to not have the to-do list in working memory. The only thing that should be in working memory is the one thing you’re actually supposed to be focusing on and doing, right now. Right?
Or if you’re instead in the mode of deciding what to do next, or making a schedule for your day, etc., then that’s different, but working memory is still kinda irrelevant because presumably you have your to-do list open on your computer, right in front of your eyes, while you do that, right?
easier and more comprehensive approach would be to use a text editor with proper version control and diff features, and then to name particular versions before making major changes.
Is that what you do? It’s not a good fit to my typical workflow. But I’m definitely in favor of trying different things and seeing what works best for you. :)
It’s not just the title but also (what I took to be) the thesis statement “Comparisons are often useful, but in this case, I think the disanalogies are much more compelling than the analogies”, and also “While I think the disanalogies are compelling…” and such.
For comparison:
Question: Are the analogies between me (Steve) and you (David) more compelling, or less compelling, than the disanalogies between me and you?
I think the correct answer to that question is “Huh? What are are you talking about?” For example:
If this is a conversation about gross anatomy or cardiovascular physiology across the animal kingdom, then you and I are generally extremely similar.
If this is a conversation about movie preferences or exercise routines, then I bet you and I are pretty different.
So at the end of the day, that question above is just meaningless, right? We shouldn’t be answering it at all.
I don’t think disanalogies between A and B can be “compelling” or “uncompelling”, any more than mathematical operations can be moist or dry. I think a disanalogy between A and B may invalidate a certain point that someone is trying to make by analogizing A to B, or it may not invalidate it, but that depends on what the point is.
FWIW I do think “Biorisk is Often an Unhelpful Analogy for AI Risk,” or “Biorisk is Misleading as a General Analogy for AI Risk” are both improvements :)
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori.
I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori.
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining.
Examples include:
MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”.
“A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?
This book argues (convincingly IMO) that it’s impossible to communicate, or even think, anything whatsoever, without the use of analogies.
If you say “AI runs on computer chips”, then the listener will parse those words by conjuring up their previous distilled experience of things-that-run-on-computer-chips, and that previous experience will be helpful in some ways and misleading in other ways.
If you say “AI is a system that…” then the listener will parse those words by conjuring up their previous distilled experience of so-called “systems”, and that previous experience will be helpful in some ways and misleading in other ways.
Etc. Right?
If you show me an introduction to AI risk for amateurs that you endorse, then I will point out the “rhetorical shortcuts that imply wrong and misleading things” that it contains—in the sense that it will have analogies between powerful AI and things-that-are-not-powerful-AI, and those analogies will be misleading in some ways (when stripped from their context and taken too far). This is impossible to avoid.
Anyway, if someone says:
When it comes to governing technology, there are some areas, like inventing new programming languages, where it’s awesome for millions of hobbyists to be freely messing around; and there are other areas, like inventing new viruses, or inventing new uranium enrichment techniques, where we definitely don’twant millions of hobbyists to be freely messing around, but instead we want to be thinking hard about regulation and secrecy. Let me explain why AI belongs in the latter category…
…then I think that’s a fine thing to say. It’s not a rhetorical shortcut, rather it’s a way to explain what you’re saying, pedagogically, by connecting it to the listener’s existing knowledge and mental models.
This is an 800-word blog post, not 5 words. There’s plenty of room for nuance.
The way it stands right now, the post is supporting conversations like:
Person A: It’s not inconceivable that the world might wildly under-invest in societal resilience against catastrophic risks even after a “warning shot” for AI. Like for example, look at the case of bio-risks—COVID just happened, so the costs of novel pandemics are right now extremely salient to everyone on Earth, and yet, (…etc.).
Person B: You idiot, bio-risks are not at all analogous to AI. Look at this blog post by David Manheim explaining why.
Or:
Person B: All technology is always good, and its consequences are always good, and spreading knowledge is always good. So let’s make open-source ASI asap.
Person A: If I hypothetically found a recipe that allowed anyone to make a novel pandemic using widely-available equipment, and then I posted it on my blog along with clearly-illustrated step-by-step instructions, and took out a billboard out in Times Square directing people to the blog post, would you view my actions as praiseworthy? What would you expect to happen in the months after I did that?
Person B: You idiot, bio-risks are not at all analogous to AI. Look at this blog post by David Manheim explaining why.
Is this what you want? I.e., are you on the side of Person B in both these cases?
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?
I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.
I have to admit that I’m struggling to find these arguments at the moment
I sometimes say things kinda like that, e.g. here.
All the examples of “RL” doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.
I guess when you say “powerful world models”, you’re suggesting that model-based RL (e.g. MuZero) is not RL but rather “RL”-in-scare-quotes. Was that your intention?
I’ve always thought of model-based RL is a central subcategory within RL, as opposed to an edge-case.
I wish you had entitled / framed this as “here are some disanalogies between biorisk and AI risk”, rather than suggesting in the title and intro that we should add up the analogies and subtract the disanalogies to get a General Factor of Analogizability between biorisk and AI risk.
We can say that they’re similar in some respects and different in other respects, and if a particular analogy-usage (in context) is leaning on an aspect in which they’re similar, that’s good, and if a particular analogy-usage (in context) is leaning on an aspect in which they’re different, that’s bad. For details and examples of what I mean, see my comments on a different post: here & here.
Let D be the distribution of (reward, trajectory) pairs for every possible trajectory.
Split D into two subsets: D1 where reward > 7 and D2 where reward ≤ 7.
Suppose that, in D, 1-in-a-googol samples is in the subset D1, and all the rest are in D2.
(For example, if a videogame involves pressing buttons for 20 minutes, you can easily have less than 1-in-a-googol chance of beating even the first mini-boss if you press the buttons randomly.)
Now we randomly pick a million samples from D in an attempt to learn the distribution D. But (as expected with overwhelming probability), it turns out that every single one of those million samples are more specifically in the subset D2.
Now consider a point X in D1 (its reward is 30). Question: Is X “out of distribution”?
Arguably no, because we set up a procedure to learn the distribution D, and D contains X.
Arguably yes, because when we ran the procedure, all the points actually used to learn the distribution were in D2, so we were kinda really only learning D2, and D2 doesn’t contain X.
(In the videogame example, if in training you never see a run that gets past the first mini-boss, then certainly intuitively we’d say that runs that do get past it, and that thus get to the next stage of the game, are OOD.)
Anyway, I was gonna say the opposite of what you said—sufficiently hard optimization via conditional sampling works in theory (i.e., if you could somehow learn D and conditionally sample on reward>7, it would work), but not in practice (because reward>7 is so hard to come by that you will never actually learn that part of D by random sampling).
A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.
Yeah, you can have something which is “a brilliant out-of-the-box solution to a tricky problem” from the AI’s perspective, but is “reward-hacking / Goodharting the value function” from the programmer’s perspective. You say tomato, I say to-mah-to.
It’s tricky because there’s economic pressure to make AIs that will find and execute brilliant out-of-the-box solutions. But we want our AIs to think outside of some of the boxes (e.g. yes you can repurpose a spare server rack frame for makeshift cable guides), but we want it to definitely stay inside other boxes (e.g. no you can’t take over the world). Unfortunately, the whole idea of “think outside the box” is that we’re not aware of all the boxes that we’re thinking inside of.
The particular failure mode of “leaving one thing out” is starting to seem less likely on the current paradigm. Katja Grace notes that image synthesis methods have no trouble generating photorealistic human faces. Diffusion models don’t “accidentally forget” that faces have nostrils, even if a human programmer trying to manually write a face image generation routine might. Similarly, large language models obey the quantity-opinion-size-age-shape-color-origin-purpose adjective order convention in English without the system designers needing to explicitly program that in or even be aware of it, despite the intuitive appeal of philosophical arguments one could make to the effect that “English is fragile.”
All three of those examples are of the form “hey here’s a lot of samples from a distribution, please output another sample from the same distribution”, which is not the kind of problem where anyone would ever expect adversarial dynamics / weird edge-cases, right?
(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))
Or maybe you were making a different point in this particular paragraph?
I appreciate the brainstorming prompt but I can’t come up with anything useful here. The things you mention are related to cortex lesions, which would presumably leave the brainstem spatial attention system intact. (Brainstem damage is more rare and often lethal.) The stuff you say about neglect is fun to think about but I can’t see situations where there would be specifically-social consequences, in a way that sheds light on what’s happening.
There might be something to the fact that the temporoparietal junction (TPJ) seems to include areas related to spatial attention, and is also somehow involved in theory-of-mind tasks. I’ve been looking into that recently—in fact, that’s part of the story of how I came to write this post. I still don’t fully understand the TPJ though.
Hmm, there do exist lesion studies related to theory-of-mind, e.g. this one—I guess I should read them.
I think I would feel characteristic innate-fear-of-heights sensations (fear + tingly sensation for me, YMMV) if I were standing on an opaque bridge over a chasm, especially if the wood is cracking and about to break. Or if I were near the edge of a roof with no railings, but couldn’t actually see down.
Neither of these claims is straightforward rock-solid proof that the thing you said is wrong, because there’s a possible elaboration of what you said that starts with “looking down” as ground truth and then generalizes that ground truth via pattern-matching / learning algorithm—but I still think that elaborated story doesn’t hang together when you work through it in detail, and that my “innate ‘center of spatial attention’ constantly darting around local 3D space” story is much better.
If I’m looking up at the clouds, or at a distant mountain range, then everything is far away (the ground could be cut off from my field-of-view)—but it doesn’t trigger the sensations of fear-of-heights, right? Also, I think blind people can be scared of heights?
Another possible fear-of-heights story just occurred to me—I added to the post in a footnote, along with why I don’t believe it.
Just wanted to flag that this post was the most helpful single thing I read about social status in the course of writing my own recent posts on that topic (part 1, part 2). Thanks!!