I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
I guess you would know better than me, so I changed it from “probably” to “often” and from “race dynamics” to “being in too much of a rush” (the latter is what I meant all along but hopefully clearer). But I’m reluctant to go further than that. I do really have an impression that most people in LLM world think that things that people have been doing (constitutions, deliberative alignment, inoculation prompting, using interp to test for eval-awareness, etc.) is what progress on alignment looks like, and if egregious misalignment and scheming happens in the future it would be either because the good guys were in too much of a hurry to iterate and develop more and better techniques in that same general genre, or because the good guys were not the ones building ASI.
RE 1, sure, “LLM will invent non-LLM ASI” is possible in principle, and would be a special case of “LLMs do not scale to ASI”. I do mention that (in the “Yudkowsky & Soares’s position [caricatured]” section).
RE 2, he wrote that “current AIs seem pretty misaligned”, not that current AIs are egregiously misaligned, scheming, and ruthless. I obviously do not think we should extrapolate from empirical observation of today’s LLMs to future ASI, but if I DID so extrapolate, I think my attitude would be vaguely like “eh, maybe future ASI will be egregious misaligned and scheming, even if people really try hard using known techniques, but probably not? And even if it happens to some degree, the AIs would still probably be at least slightly nice, and maybe that’s good enough?” That would be the kind of thing LLM people might say. By contrast, Yudkowsky & Soares (and me) are very very much more pessimistic than that.
Eh, I see that as a separate debate. (I.e., “Suppose Yudkowsky & Soares are right that ASI will definitely be egregiously-misaligned & scheming in the absence of yet-to-be-invented breakthrough technical alignment ideas. Is it plausible that weaker AIs could find those breakthrough technical alignment ideas? Or not?” That’s a live debate, but it’s a different debate than I’m discussing in this post. Lots of people would not grant the premise.)
I feel like I learn a lot of things by doing a project while feeling enough slack to explore a possible better way to do it (but with more of a learning curve), and wind up learning something new and important that I can use going forward. Or just go down a rabbit’s hole trying to understand something related to what I’m doing. If I was always doing things in a rush for my whole life, I feel like I would know many fewer things and have many fewer skills, even if I would have done more projects.
I understand that different roles have different tradeoffs (and different people are different) but I’m wondering if you’ve noticed anything like that.
(On reflection, this is more a semi-redundant riff on what you already wrote, and less a good responsive comment, but oh well, I already wrote it so I guess I’ll hit publish.)
When I think about the challenges with applying Solomonoff induction in practice, which the scientific method was designed around, I see two main things.
The first, as you point out, is that hypotheses are sufficiently modular that we CAN develop them piecemeal, and sufficiently complicated that we MUST develop them piecemeal. Thus, scientific hypotheses (being just one modular piece of a “real” hypothesis) can “remain agnostic” about certain observations. As I joked here, “if you treat The Law Of Conservation Of Energy as a “hypothesis”, and you “ask” Conservation Of Energy what the half-life of tritium is, then Conservation Of Energy will tell you “Huh? How should I know? Why are you asking me?”” This property (that hypotheses can be agnostic about things) is also characteristic of logical induction / prediction markets (as you point out), and infra-Bayesianism has that property too.
The second is that parsimony / Occam’s razor / Solomonoff prior is central to finding the truth, but scientists range from being imperfect at assessing the complexity-vs-parsimony of a theory, to being atrociously bad at it. So the scientific enterprise is set up to rely as little as possible at complexity-assessments. Thus, as you point out, if we could perfectly assess the complexity-vs-parsimony of a hypothesis, then there would be no need to treat prediction and retrodiction differently. The retrodiction problem is about putting too many bits into the hypothesis, and it’s only a problem because people are lousy at taking a theory and reading out the number of bits in it (i.e. they don’t notice epicycles and special pleading). So again, the scientific institution is set up to minimize reliance on complexity-assessments. But that minimum reliance is still higher than zero. You just can’t get away from it entirely. Even “data” is not theory-free, because you need theory to get from “raw” data to so-called observations.
The results are different in different fields (and sometimes pathological), as a field might or might not equilibrate to a state where practitioners with the sharpest discernment of complexity-vs-parsimony command the most respect and power and sway. See my comments here & here for hot-take examples from real-world academic fields.
So anyway, if I were working on this project, the first thing I would try is to say that the “ideal” is Solomonoff induction searching for a true hypothesis which happens to be modular (i.e. it has different pieces covering different, cleanly-separable domains), and then introduce a constraint that you can only measure bits-of-complexity with an extremely noisy ruler, and try to judge truth-seeking setups like LI etc. by how well they approximate the Solomonoff ideal under those assumptions / constraints. (...But I dunno, I didn’t think about it very hard.)
(Data point: I was complaining a bit about the effects of Inkhaven 1 on lesswrong, but Inkhaven 2 seemed fine.)
No I don’t find that plausible, sorry I don’t have time to explain why but this post section is related to where I’m coming from.
The OP is about the “deep learning sample efficiency gap”. But that’s not a deep learning paper. So I don’t think it provides any evidence here.
I agree that the social world is usually very very important for (1) making options salient, and (2) making options seem appealing, and (3) providing evidence about the consequences of different options. I think that’s the kernel of truth that this post is gesturing at.
But I think you’re taking that observation WAY too far.
In particular, the social world is not REQUIRED for any of those three things.
For one thing, if people learn planning from other people, where did it come from in the first place? Somebody had to have been the first, right?
For another thing, sometimes people do quite unusual things in the effective pursuit of goals. E.g. Jeff Bezos founded Amazon in order to get enough money to pursue his real dream of running a space exploration company. Who would he have learned that from?
(I think some people are more motivated by following norms than others. Sociopaths, autistics, and “high-agency people” would typically be on the lower end of norm-following motivation, so I would look there first to find especially clear-cut evidence of non-social agency.)
For another thing, if you take someone’s general advice (say, they counsel “it’s better to ask for forgiveness than permission!”), and then next week you end up humiliated and with a painful broken arm and giant hospital bill, aren’t you marginally less likely to follow that same heuristic in the future? Conversely, if you adopt their general advice and then next week you end up with a proud new accomplishment under your belt, aren’t you marginally more likely to follow that same heuristic in the future? Obviously yes, right? So doesn’t this constitute “[learning] through feedback about how well they fulfill your goals”?
“Low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”… I would also guess Steven Byrnes believes this (see below).
No, I think I’d mostly disagree with that statement. I think planning is basically innate, although it’s augmented by a lifetime of learning how to plan better (e.g. you can learn metacognitive heuristics from experience, or from reading a book etc.).
It’s not clear to me how “order food → food will come” is even supposed to be learned by the brain’s self-supervised learning/predictive processing or RL. The prediction error/reward comes in _a week_ after the prediction. And if it’s somehow deduced from higher-level knowledge about the world—how did that get learned?
Obviously it’s a hard problem that AI researchers have not solved yet, but it’s equally obvious to me that a solution exists in the brain. It seems crazy to me to deny that. We make a zillion accurate long-term predictions about the world every day (e.g. “if I put on ripped pants right now at 8am, then my knees might get cold when I’m outside at 10pm tonight, and I know this because it happened to me yesterday”). We make way too many long-term predictions in way too many circumstances to have learned all of them from observing or listening to other people. And even the things that we did learn from someone else telling us, that person in turn had to have learned it somehow, and if we trace that chain back then it eventually has to end in somebody actually figuring something out by observing the world. Right?
Have you really never in your life figured out something like “X today implies Y tomorrow” all by yourself that you didn’t learn from someone else??
I feel like I’m probably misunderstanding your position here, because it really seems crazy to me.
I’ve gotten into the habit of trying to model what’s going on when I experience an impulse for an action that could be interpreted as ”long-term planning”, and it seems to me that it’s all actually just a bunch of superficial, distinct, socially learned behavioral patterns, rather than any planning through a world model or any general/sophisticated heuristics for accomplishing long-term goals
Maybe you should read Cate Hall’s book when it comes out? :-P
OK here’s an example that I challenge you to explain: if I’m hungry, I might take a bus to the restaurant to get a slice of pizza, but if I’m not hungry, then I won’t.
The obvious explanation that I endorse is: when I’m hungry, eating pizza seems good and motivating, so I make a plan to eat pizza, and execute the plan. When I’m not hungry, eating pizza seems pointless or aversive, so I don’t.
By contrast, this seems impossible to explain in your framework. If I’m just copying people, how can that get linked to my own interoceptive sensation of hunger? That sensation is private to me, and other people’s sensations of hunger is private to them. There’s no SOCIAL logic behind connecting my own internal sensation of hunger to a plan-to-eat. Right?
Moreover, the plan to eat pizza is clearly “planning through a world-model”. For example, if it’s 4am and the buses aren’t running and the pizza place is closed, then I won’t try to take a bus to the pizza place. If there’s a wildfire blazing between me and the restaurant, then I also won’t try to go there. I will set out to the restaurant only if it seems like eating pizza is the plausible result of doing so. Because I want to eat pizza.
Of course, I’m not omniscient, and even beyond that, sometimes I “know” something but temporarily forgot about it. Like, maybe I forgot that the restaurant owner was on vacation. Oops. But that doesn’t undermine the idea that I am hungry, and trying to get pizza so I can eat it. The goal (eating pizza) is in my mind, and I am brainstorming how to make that goal happen. Right??
Anyway, I reiterate the first paragraph of my comment, that there’s a kernel of truth here, and that it’s very important, even if I think you’re taking it way too far.
That all sounds fine, if we’re engaged in a pragmatic project for deciding what to do, and want to propose an answer that you and I can get behind, and that lots of people around the world can also get behind.
I think Arjun is (rightly) complaining about something different, namely that Eliezer and you and others frequently slip into treating this answer as being fundamentally privileged / “Right”, as opposed to merely a pragmatic option that you and I and lots of people can get behind.
E.g. here’s Nate referring to “the future’s potential value”, as if there’s a metric for that which is canonical and characteristic of humanity-as-a-whole. I think that’s moral-realist (or “crypto”-moral-realist) thinking, sneaking in.
(Interesting post, thanks for writing it!)
I do believe the brain has much higher sample efficiency than existing DNN algorithms, in the sense that matters for guessing future ASI compute requirements. But I agree that pinning down the comparison is a bit subtle.
(Also, sample-efficiency is not the main reason why I think that FLOP-required-for-ASI is low, but rather trying to guess how much compute the brain is doing. But sure, sample-efficiency is not totally irrelevant to how I think about these things, I suppose.)
The sensory data going to the brain is (I think) >99% visual, and >99.5% visual + audio. (IIRC … I didn’t double-check, and it’s kinda controversial how to calculate it anyway…)
So it’s interesting that congenitally blind people, and deafblind people, are basically just as smart and competent as sighted & hearing people, except obviously in contexts where the missing sensory data is directly relevant. I think this observation generally pushes against a perspective that centers the story of human intelligence around our abundant sensory data.
And more specifically, RE your Appendix, if we’re going to compare frontier LLM training data with human sensory data, we should also be putting blind and deafblind people onto that same plots / tables. And also, if we’re comparing sighted people to frontier models, we need to include the frontier models’ visual training data, not just text token training data … I don’t know how many extra bytes that would be, but I’d guess a lot.
I’m not exactly sure what point you’re trying to make with the discussion of Dreamer, EfficientZero, and related, but (copying from an argument I had on this topic in 2021):
I think that if somebody wants to understand AlphaZero, the fact that it trained on 40,000,000 games of self-play is a highly relevant and interesting datapoint. Suppose you were to then say “…but of those 40,000,000 games, fundamentally it really only needed 100 games with the external simulator to learn the rules. The other 39,999,900 games might as well have been ‘in its head’. This was proven in follow-up work.”. I would reply: “Oh. OK. That’s interesting too. But I still care about the 40,000,000 number. I still see that number as a very important part of understanding the nature of AlphaZero and similar systems.”
Anyway, if a human is playing chess in his head, or replaying a memory of
that embarrassing thing that I did one time in middle schoolwhat they did yesterday, then he is not paying attention to sensory input. He’s probably mostly zoning out. So in a certain sense, the replay is replacing sensory data, as opposed to increasing the effective total amount of data, in humans. So, like, the thing in §3 where you note that LLMs can be more “sample-efficient” by doing 4 epochs of the same data, or the thing that EfficientZero etc. does, well, if you’re talking about sample-efficiency for the pragmatic reason of trying to solve AI problem where you have lots of compute but strictly limited data, then cool, that kind of thing is helpful and important. But if you’re talking about sample-efficiency in the context of trying to compare and contrast humans versus current AIs, then I think those tricks are somewhat off-topic.I concede that “brains are kinda like insanely huge 100-trillion-parameter LLMs, and that’s BOTH why we don’t have AGI yet AND why brains are (in certain senses) more sample-efficient” is a story that hangs together. And it’s a pretty popular story in LLM circles because it also fits in with scale-is-all-you-need. I really don’t think that story is right, for lots of reasons, including neuroscience stuff that I don’t want to get into, but also just, like, noticing all the ways that brains are quite different from insanely huge LLMs. There’s the continual learning stuff, the model-based RL stuff, the brain’s complete absence of “true” imitative learning, the way that cortical microcircuits simply do not look anything like transformer layers, etc.
A teenager can learn to drive in a few dozen hours; self-driving systems are trained for years on billions of miles of data. …
Steven Byrnes appears to read the gap as evidence that current algorithms are far from what the brain is doing, such that much better algorithms must be waiting to be found.
I think you’re attributing an argument to me which I wasn’t making (in the context of that post that you copied the diagram from). I agree that comparing 30 hours of teen driving practice to umpteen gazillion hours of Waymo training data is apples-and-oranges because the teen also has life experience.
But I was making a different point, which (in my own words) was: “…we don’t have AGI (artificial general intelligence) yet—not as I use the term…”. (I’m not even sure you disagree with that??)
Anyway, it is NOT the case that it’s possible to make self-driving cars by taking some generic learning algorithm that we already know about, and letting it spend the equivalent of 18 years roaming around and doing stuff in various virtual environments like VR & MineCraft, and watching YouTube videos, and reading books, and whatever, and THEN have it spend 30 hours with minimal instruction driving actual cars, and bam, now you have a human-level self-driving car. There is no generic learning algorithm today that can do that, right? If there were such an algorithm, then surely somebody would have done that already. That would have been way way way easier than what Waymo and Tesla etc. have been actually doing. So I think this example is fair game: the brain can do things that no existing AI algorithm can do, even in an apples-to-apples comparison that holds data availability fixed.
Maybe your response would be: “Oh yeah that’s easy, someone could totally do that, it’s just that nobody has bothered because the resulting AI would be too big to fit in a car computer”?? Or “Oh yeah, we totally know how to do that, it’s just that it would require more compute than would be affordable or practical at the present time”?? If so, I disagree with both of those possible objections, and we can get into why if it’s crux-y.
What people take this to mean varies widely. Steven Byrnes appears to read the gap as evidence that current algorithms are far from what the brain is doing, such that much better algorithms must be waiting to be found. His guess is that human-level, human-speed AGI will require not a datacenter but “one consumer gaming GPU,” even for training from scratch. Yarrow Bouchard on the EA Forum, reads the same gap as evidence that AGI isn’t close at all, precisely because nobody knows how to close it. Nearly opposite conclusions from the same starting observation.
I’m confused, these don’t sound “nearly opposite” to me, they sound very compatible. Did you misread something? Or maybe you’re noticing that Yarrow & I have opposite vibe and emphasis, even when we’re saying basically the same thing?
(I very strongly disagree with Yarrow about all kinds of things, but I don’t think this paragraph is pointing to an example. Here’s an example where I was partly agreeing and partly disagreeing with Yarrow on something in the vicinity of this topic.)
If you ask lots of people whether their moral preferences ought to be self-consistent, they’ll mostly say yes. If you ask lots of people whether their moral preferences are more valid after they think about them longer, after a good night’s sleep, they’ll also mostly say yes.
But also, if you ask lots of people whether it’s moral for their family to be tortured, they’ll mostly say no. And they probably won’t say that no-torture is less important than self-consistency.
Here are three (IMO reasonable) people arguing that moral deliberation / self-consistency does not straightforwardly and universally trump other ways to reach normative conclusions: Scott Alexander:
But I’m not sure I want to play the philosophy game. Maybe MacAskill can come up with some clever proof that the commitments I list above imply I have to have my eyes pecked out by angry seagulls or something. If that’s true, I will just not do that, and switch to some other set of axioms. If I can’t find any system of axioms that doesn’t do something terrible when extended to infinity, I will just refuse to extend things to infinity.
plus Stuart Armstrong here, and Joe Carlsmith discusses this a bunch (kinda arguing both sides) here & here & here.
Anyway, if we’re gonna treat CEV (and related things like Long Reflection) as meta-ethical ground truth (and not just as pragmatic projects to design a widely-acceptable ASI motivation system, per my other comment), then we have to grant moral deliberation and self-consistency a special status, NOT just “well yeah self-consistency is one of the things that people feel is good and right, along with all the other things that people feel are good and right”. And I think Arjun is asking: where would this special status come from?
It’s evidently not grounded in people’s moral intuitions, because people’s moral intuitions in favor of self-consistency are not systematically stronger or different-in-kind from people’s moral intuitions in favor of justice or whatever else. Alternatively, if we want to ground it in, like, “well they’d appreciate the value of self-consistency if they thought about it more”, then that’s circular question-begging, because it’s already granting a special status to deliberation.
I mostly agree with this (see here). My meta-ethical stance is kinda more nihilism-adjacent when compared to Eliezer (& Nate, Habryka, etc.) who are more moral-realism-adjacent. For example they’ll casually refer to “the future’s potential value” as if it’s a meaningful metric that is canonical and characteristic of humanity as a whole, not just value-from-a-particular-person’s-perspective, nor value-relative-to-a-certain-semi-arbitrary-operationalization-of-the-details-of-CEV, etc.
That said, we do face an issue that I happen to expect an ASI singleton in my lifetime, and its preferences will determine the future, for better or worse. Things like CEV / Long Reflection seem to have promise as political projects—like, flags that lots of people might feel motivated to rally around, because they all feel enthusiastic about the future that this would lead to, and which I personally also feel enthusiastic about (well, at least potentially, the details matter). They certainly seem less bad and unfair than lots of other options. Are the CEV / Long Reflection results well-defined and independent of arbitrary details of the deliberation process? My guess is: Probably not! But oh well, we have to do something, and there aren’t obviously better options.
Most of your comment seems specific to LLMs, and I don’t work on those, so no opinion.
Most of the humans whom I’ve seen put forward as moral and ethical exemplars (people who’ve foster-parented dozens or hundreds of children, donated organs to strangers, saved refugees from famine, war, persecution, or all three, spoken out against institutional violence at great personal risk, etc.) have based those actions on something closer to a virtue ethical or deontological framework than a consequentialist utilitarian one.
This might be tangential to your larger point, but based on your list of examples, I think you (like most people) are implicitly using virtue-ethics as a rubric to judge which humans are most praiseworthy. So it’s no surprise that the winners are generally acting out of virtue-ethics. By contrast, if you ask a utilitarian which humans are most praiseworthy, they would be less likely to mention the foster-parents etc., and much more likely to mention, like, Norman Borlaug, Bill Gates, these people, etc. And I would guess that those latter people would be somewhat more consequentialist-utilitarian than average in how they choose their actions. (That’s just a guess, I don’t know much about most of them, except that I watched a biopic of Bill Gates once and he didn’t come across as extremely stereotypically virtuous.)
(I’m making a narrow point that you used a circular argument, I am not trying to imply here that AIs should or shouldn’t be virtuous. But see this comment.)
Thanks. I just edited the OP to say that my original text might be an overstatement.
I still think the stopgap plan doesn’t help me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limited number of goals (they kinda have to be simple, concrete, “atomic”, and/or directly related to people’s feelings, and/or have a ground truth that can be calculated explicitly, more-or-less). This thing we’re talking about here (involving a distinction between the supervisor’s instrumental vs terminal goals) is pretty complex and abstract, and not something I have any good idea of how to install as a goal / motivation, alas.
LLMs are pretty different, no comment on that.
I feel like some of the stuff about “nitpicking” / “non central objections” / principle of charity / etc. is people talking past each other regarding two different things.
The FIRST THING is “non-load-bearing errors”. An unusually clear-cut example would be: Alice publishes a math proof, and the summation in equation (17) starts from 0 when it’s supposed to start from 1. It’s kinda obvious from context that it’s supposed to start from 1, and the proof as a whole would be valid once that’s corrected, but it’s still an error as written. Bob reads the manuscript and points out the mistake to Alice.
The SECOND THING is “Gricean failures”. An unusually clear-cut example would be: Alice says “I need to fill my car with gas”, and Bob says, “Well, no, you mean fill the car’s gas tank with gas. You’re not gonna be closing the doors and pouring gasoline through the windows onto the seats!!”
Hopefully we can all agree that Bob is being helpful in the first example and unhelpful (and annoying) in the second example. Outside of formal contexts like math, communication is always hard, and always involves imperfect analogies, ambiguities, etc. The speaker can and should do what they can to make the listener’s job easier, but ultimately the listener will inevitably need to apply at least some interpretive effort, using the principle of charity, to figure out what the speaker probably intended. Hence Grice’s maxims.
(I think my two chosen examples are at opposite extremes of a spectrum, with shades of gray in between, as opposed to “non-load-bearing errors” versus “Gricean failures” being two discrete categories.)
So anyway, I feel like at least some of the dispute is that some people are accusing Said of doing the annoying & unhelpful second thing (“Gricean failures”), and then the OP (and Said himself) are reacting with horror to the idea that people don’t want to be apprised of the first thing (“non-load-bearing errors”).
(I’m not very familiar with Said (I don’t recall him commenting on my posts ever?) so don’t have a very strong opinion either way, but I just read the famous 2018 comment on “Zetetic Explanation”, and I think I’d vote for this comment being an example of the bad second thing, not the good first thing.)
I can’t find the book you’re thinking of. :( [Could it be this one??]
I think the right starting point is not whether something is an LLM, or deep learning, but rather what are the inputs, outputs, loss functions, etc.? And then go from there to whether we expect slight-niceness or not.
My own opinion (stated without justification) is: you can get niceness through LLM-style “true” imitation learning (Foom & Doom §2.3.2). Alternatively, if the AI is choosing actions through RL and/or model-based search & planning, rather than through imitation learning, than I expect zero-niceness, and instead the ruthless pursuit of the objective, or of something vaguely related to the objective, with ample specification gaming and so on (e.g. “be helpful” gets ruthless-ified into “come across as helpful”).
…Except that there exist weird objective / reward / cost functions that don’t have that property, but rather support niceness. And humans wound up with such a function via evolution doing an outer-loop search over reward functions in a certain type of environment where niceness was advantageous. In principle, future AI programmers could likewise do an outer-loop search over reward functions, but they probably won’t, because any kind of outer-loop search over scaled-up learning algorithms is hella expensive. If they do it at all, it would be a situation where the programmer crafted the reward function up to a handful of adjustable parameters, and then the outer-loop search would be a kind of hyperparameter tuning. And then the alignment challenges would be (1) crafting a reward function (up to the handful of unknown adjustable parameters) that supports niceness, (2) figuring out what the outer-loop test environment and selection criterion is, such that the selected reward function hyperparameters will lead to niceness towards humans in the real post-ASI world despite the wild distribution shift from the test environment. That’s basically what I’m working on, and I claim that not only are these open problems but that all the ideas in the literature will almost definitely fail.