Raemon comments on Raemon’s Shortform

Raemon 4 Oct 2025 0:55 UTC
14 points
2
I’ve heard ~”I don’t really get this concept of ‘intelligence in the limit’” a couple times this week.
Which seems worth responding to, but I’m not sure how.
It seemed like some combination of: “wait, why do we care about ‘superintelligence in the limit’ as opposed to any particular “superintelligence-in-practice?”, as well as “what exactly do we mean by The Limit?” and “why would we think The Limit shaped the way Yudkowsky thinks?”
My impression, based on my two most recent conversations about it, is that this is not only sort of cloudy and confusing feeling to some people, but, also, it’s intertwined with a few other things that are separately cloudy and confusing. And also it’s intertwined with other things that aren’t cloudy and confusing per-se, but there’s a lot of individual arguments to keep track of, so it’s easy to get lost.
One ontology here is:
- it’s useful to reason with nice abstractions that generalize to different situations.
  - (It’s easier to think about such abstractions at extremes, given simple assumptions)
- its also useful to reason about the nitty-gritty details of a particular implementation of a thing.
- it’s useful to be able to move back and forth between abstractions, and specific implementations.
One person I chatted with seemed to both be frustrated simultaneously with:
[note, not sure if this is a correct summary of them, they can pop up here to clarify if they want]
“Why does this concept of corrigibility need to care about the specifics of ‘powerful enough to end the acute risk period?’ Why can’t we just think about iteratively making an AI more corrigible as we improve it’s capabilities? That can’t possibly be important to the type-signature of corrigibility?”
and also: “what is this ‘The Limit?’ and why do I care?”
and also: “Why is MIRI always talking about abstractions and not talking about the specific implementation details of how takeoff will work in practice? It seems sus and castles-on-sand-y”
My answer was “well, the type signature of corrigibility doesn’t care about any particular powerful level, but, it’s useful to have typed out “what does corrigibility look at in the limit?”. But the reason it’s useful to specifically think about corrigibility at The Limit is because the actually nitty-gritty-details of what we need corrigibility for require absurd power levels (i.e. preventing loss by immediate AI takeover, and also loss from longterm evolution a couple decades later).
It seemed like what was going on, in this case, was they were attempting to loop through the “abstractions” and “gritty details” and “interplay between the two”, but they didn’t have a good handle on abstraction-in-The-Limit, and then they couldn’t actually complete the loop. And because it was fuzzy to think about the gritty-details without a clear abstraction, and hard to think about abstractions without realistic details, this was making it hard to think about either. (Even though, AFAICT, the problem lay mostly in The Abstract Limit side, not the Realistic Details side)
...
A different person recently seemed to similarly not be grokking “why do we care about the Limit?”, but the corresponding problem wasn’t about any particular other argument, just, there were a lot of other arguments and they weren’t keeping track of all of them at once, and it seemed like they were getting lost in a more mundane way.
...
I don’t actually know what to do with any of this, because I’m not sure what’s confusing about “Intelligence in the limit.”
(Or: I get that there’s a lot of fuzziness there you need to keep track of while reasoning about this. But the basic concept of “well, if it was imperfect at either not-getting-resource-pumped, or making suboptimal game theory choices, or if it gave up when it got stuck, it would know that it wasn’t as cognitively powerful as it could be, and would want to find ways to be more cognitively powerful all-else-equal”… seems straightforward to me, and I’m not sure what makes it not straightforward seeming to others).
- 1a3orn 4 Oct 2025 16:22 UTC
  25 points
  1
  Parent
  Hrm. Let me try to give some examples of things I find comprehensible “in the limit” and other things I do not, to try to get it across. In general, grappling for principles, I think that
  - (1) reasoning in the limit requires you to have a pretty specific notion of what you’re pushing to the limit. If you’re uncertain what function f(x) does stands for, or what “x” is, then talking about what f(x + 1000) looks like is gonna be tough. It doesn’t get clearer just because it’s further away.
  - (2) if you can reason in the limit, you should be able to reason about the not-limit well. If you’re really confused about what f(x + 1) looks like, even though you know f(x), then thinking about f(x + 10000) doesn’t look any better.
  So, examples and counterexamples and analogies.
  The Neural Tangent Kernel is theoretical framework meant to help understand what NNs do. It is meant to apply in the limit of an “infinite width” neural network. Notably, although I cannot test an infinite limit neural network, I can make my neural networks wider—I know what that means to move X to X + 1, even though X → inf is not available. People are (of course) uncertain if the NTK is true, but it at least, kinda, makes sense to me for this reason.
  Black holes are what happen in the limit as you increase mass. They were (kinda) obvious, once you put together a few equations about gravity and light, at least in the sense they were hypothesized a while ago. But it was unclear what would actually happen—Einstein argued they were impossible with some weird arguments in 1939, but a few months later it turned out he was wrong.
  
  BUT most relevantly for my point here, black holes are not like, in the limit of infinite mass. That still isn’t a thing—physically, infinite mass just consumes everything, I think? But black holes are of sufficiently high mass that weird things happen—and notably you need a specific theory to tell you where those weird things happen. But they aren’t just like a pure “in the limit of mass” argument—they’re a result of a specific belief about how things change continuously as you get massier, with clear predictions, that results in weirdness at a specific point past those predictions, happening because there were other, specific predictions about what would happen before things got weird.
  
  Moving on to intelligence as an application of the above.
  
  So like, Yudkowsky’s argument on corrigibility contains the following sentence:
  
  Suppose that we trained an LLM-like AI to exhibit the behavior “don’t resist being modified” — and then applied some method to make it smarter.
  
  <frustration>To which I scream WHAT METHOD</frustration>. Like, leaving to the side what it looks like as you apply this unnamed method a lot, what I really care about is what happens even when you apply this method a little!
  
  Like let’s imagine that we apply some method to a more familiar object—myself. Suppose we apply some method to make 1a3orn smarter and more effective at accomplishing his goals. Different methods could conceivably work would be:
  - I take bunch of research chemicals, NSI-189, Dihexa, whatever that guy who said he could raise intelligence was on about, even more obscure and newer chemicals, while trying to do effortful practice at long-range goals.
  - A billionaire gives Ray a huge grant. I get into a new program he constructs, where we have like an adult-Montessori environment of “Baba is You” like problems and GPQA-like problems, according to an iterated schedule designed to keep you at the perfect level of interest and difficulty.
  - I get uploaded into a computer, and can start adjusting the “virtual chemistry” of my brain (at first) to learn effectively, but then can start altering myself any way I wish. I can—if I wish—spawn parent versions of myself with my un-edited brains, in case my value starts drifting.
  - Like the above upload, but without being able to spawn parent versions that supervise for value drift.
  - Like the above upload, but I’m like, one of a huge society of versions of me who can eject those who drift too far survivor-style
  - A billionaire gives Ray a huge grant, and separately Ray bitflips into evil-no-deontology-Ray because of errant cosmic radiation. He kidnaps me and several other people, makes us wear shock-collars, and has us do “Baba is You” planning and numerous other challenges, just at 3x the intensity of the prior program, because this is the best way to save the world. (He doesn’t shock us tooo much that wouldn’t be effective.)
  Etc etc etc.
  
  Even granting that all these scenarios might result in greater capability—which I think is at least possible* -- I expect that all these scenarios would result in me having very different degrees of coherence, capability profiles, corrigibility, and so on.
  
  And like, my overall belief is that (1) reasoning about intelligence “in the limit” seems like reasoning about all the scenarios above at once. But whatever beliefs I have about intelligence in the limit are—generally—causally screened off once I contemplate the concrete details of the above scenarios; the actual feedback loops, the actual data. And I similarly expect whatever beliefs I have about AI intelligence in the limit to be causally screened off once I contemplate the details of whatever process produces the AI.
  
  Put alternately: Intelligence “in the limit” implies that you’ve executed an iterative update process for that intelligence many times. But there are many such iterative update processes! It seems clear (?!?) that they can converge on surprisingly different areas of competence, even for identical underlying architectures. If you can explain what happens in the limit at iteration 10,000 you should be able to at least talk universally about iteration 100, but… I’m not sure what that is.
  
  I’m a little dissatisfied with the above but I hope it at least gets across why I feel like “in the limit” is vague / underspecified to me.
  
  I think it’s actually like, drawing on a math metaphor but without the underlying preciseness that makes the math actually work? So I think it sort of creates a mental “blank space” in one’s map, which then gets filled in with whatever various notions one has about intelligence drawn from a variety of sources, in a kind of analogical ad-hoc fashion. And that something like that process (????) is what implies dooms.
  
  Maybe it would have been better to talk about why great power does not imply very high coherence idk.
  - Raemon 4 Oct 2025 21:06 UTC
    4 points
    0
    Parent
    Nod, makes sense.
    One thing I maybe should note, I don’t think Yudkowsky ever actually said “in the limit” per se, that was me glosseing various things he said, and I’m suddenly worried about subtle games of telephone about whatever he meant.
    Another thing I thought of reading this (and maybe @johnswentworth’s Framing Practicum finally paying off, is that a better word than “limit” might be “equilibrium.”
    i.e. this isn’t (necessarily) about “there is some f(x), where if you dial up X from 10 to 11 to 100 to 10,000, you expect f(x) to approach some limit”. A different angle of looking it is “what are the plausible stable equilibria that a mind could end up in, or the solar-system-system could end up in?”
    A system reaching equilibria includes multiple forces pushing on stuff and interacting with each other, until they settle into a shape where it’s hard to really move the outcome –until something new shocks the system.
    ...
    Some ~specific things you might care about the equilibrium of:
    A. One particular AI mind – given some initial conditions, after somehow achieving a minimum threshold of relentless-creative-resourcefulness, and the ability to modify itself and/or it’s environment, and it has whatever combo of goals/impulses it turns out to have.
    The equilibrium includes “what will the mind end up doing with itself” and also “how will the outside world try to apply pressure to the mind, and how will the mind apply pressure back?”.
    B. The human economy/geopolitical-system. Given that there are lots of groups trying to build AI, there’s a clear economic incentive to do so if you don’t believe in doom, and it’s going to get easier over time. (But also, there are reasons for various political factions to oppose this).
    Does this eventually produce a mind, with the conditions to kick off the previous point?
    C. The collection of AI minds that end up existing, once some of them hit the minimum relentless-creative-resourcesfullness necessary to kick off A?
    ...
    But translating back into limits:
    Looking at your list of “which of these f(x)s are we talking about?”, the answer is “the humanity meta-system that includes all of B.”
    “X” is “human labor + resource capital + time, etc”.
    The “F” I’m most focused on is “the process of looking at the current set of AI systems, and asking ‘is there a way to improve how much profit/fame/power we can get out of this?’, and then creatively selecting a thing (such as from your list of things above), and then trying it.”
    (It’s also useful to ask “what’s F?” re: a given transformer gradient descent architecture, given a set of training data and a process for generating more training data. But, that’s a narrower question, and most such systems will not be the “It” that would kill everyone if anyone build it)
    ...
    Having said that:
    “f(x)”, where f is “all human ingenuity focused on building AGI, + all opposed political focuses”, is a confusing type, yes.
    I mentioned elsewhere, the “confusing type” is the problem. (or, “a” problem). We are inside a “Find the Correct Types” problem. The thing to do when you’re in a Find the Correct Types problem, is bust out your Handle Confusion and Find the Correct Types toolkit.
    I am not a Type Theory Catgirl, but, some early steps I’d want to take are:
    map out everything I am confused by that seems relevant (see if some confusions dissolve when I look at them)
    map out everything important that seems relevant that I’m not confused by
    map out at least a few different ways of structuring the problem. (including, maybe this isn’t actually best thought of as a Type Theory problem)
    And part of my response to “f(x) is confusing” is to articulate the stuff above, which hopefully narrows down the confusion slightly. But, I’d also say, before getting to the point of articulating the above, “a’ight, seems like the structure here is something like”
    1. AI will probably eventually get built somewhere. It might FOOM. It might takeover. And later, evolution might destroy everything we care about. (You might be uncertain about these, and might confused about some sub-pieces, but I don’t think you-in-particular were confused about this bit)
    2. There will be some processes that take in resources and turn them into more intelligence. [FLAG: confused about what this process is and what inputs it involved. But, call this confusing thing f(x)]
    3. There are lots of different possible shapes of f(x), I’m confused about that
    4. But, the reason I care about f(x) is so that I know either a) will a given AI system FOOM or Takeover? or b) is it capable of stopping other things from FOOMing or taking over? and c) is it capable of preventing death-by-evolution, without causing worse side effects?
    And #4 is what specifies which possible ways of resolving confusing bits are most useful. It specifically implies we need to be talking about pretty high powerlevels. However you choose to wrap your brain around it, it somehow need to eventually help you think about extremely high power levels.
    So, like, yep “in the limit” is confusing and underspecified. But, it’s meant to be directing your attention to aspects of the confusingness that are more relevant.
- Kaarel 4 Oct 2025 3:43 UTC
  6 points
  0
  Parent
  But the basic concept of “well, if it was imperfect at either not-getting-resource-pumped, or making suboptimal game theory choices, or if it gave up when it got stuck, it would know that it wasn’t as cognitively powerful as it could be, and would want to find ways to be more cognitively powerful all-else-equal”… seems straightforward to me, and I’m not sure what makes it not straightforward seeming to others
  
  I think there’s a true and fairly straightforward thing here and also a non-straightforward-to-me and in fact imo false/confused adjacent thing. The true and fairly straightforward thing is captured by stuff like:
  - as a mind $M$ grows, it comes to have more and better and more efficient technologies (e.g. you get electricity and you make lower-resistance wires)
  - (relatedly) as $M$ grows, it employs bigger constellations of parts that cohere (i.e., that work well together; e.g. [hand axes → fighter jets] or [Euclid’s geometry → scheme-theoretic algebraic geometry])
  - as $M$ grows, it has an easier time getting any particular thing done, it sees more/better ways to do any particular thing, it can consider more/better plans for any particular thing, it has more and better methods for any particular context, it has more ideas, it asks better questions, it would learn any given thing faster
  - as $M$ grows, it becomes more resilient vs some given processes; another mind of some fixed capability would have a harder time pointing out important mistakes $M$ is making or teaching $M$ new useful tricks
  The non-straightforward-to-me and in fact imo probably in at least some important sense false/confused adjacent thing is captured by stuff like:
  - as a mind $M$ grows, it gets close to never getting stuck
  - as $M$ grows, it gets close to not being silly
  - as $M$ grows, it gets close to being unexploitable, to being perfect at not getting resource-pumped
  - as $M$ grows, it gets close to “being coherent”
  - as $M$ grows, it gets close to playing optimal moves in the games it faces
  - as $M$ grows, it gets close to being as cognitively powerful as it could be
  - as $M$ grows, it gets close to being happy with the way it is — close to full self-endorsement
  Hopefully it’s clear from this what the distinction is, and hopefully one can at least “a priori imagine” these two things not being equivalent.^[1] I’m not going to give an argument for propositions in the latter cluster being false/confused here^[2], at least not in the present comment, but I say a bunch of relevant stuff here and I make a small relevant point here.
  
  That said, I think one can say many/most MIRI-esque things without claiming that minds get close to having these properties and without claiming that a growing mind approaches some limit.
  ↩︎
  If you can’t imagine it at first, maybe try imagining that the growing mind faces a “growing world” — an increasingly difficult curriculum of games etc.. For example, you could have it suck a lot less at playing tic-tac-toe than it used to but still suck a lot at chess, and if it used to play tic-tac-toe but it’s playing chess now then there is a reasonable sense in which it could easily be further from playing optimal moves now — like, if we look at its skill at the games it is supposed to be playing now. Alternatively, when judging how much it sucks, we could always integrate across all games with a measure that isn’t changing in time, but still end up with the verdict that it is always infinitely far from not sucking at games at any finite time, and that it always has more improvements to make (negentropy or whatever willing) than it has already made.
  
  ↩︎
  beyond what I said in the previous footnote :)
  - Raemon 4 Oct 2025 6:55 UTC
    5 points
    0
    Parent
    The thing I care about here is not “what happens as a mind grows”, in some abstract sense.
    The thing I care about is, “what is the best way for a powerful system to accomplish a very difficult goal quickly/reliably?” (which is what we want the AI for)
    As either we deliberately scale up the AI’s ability to accomplish stuff, it will be true that:
    if it is getting stuck, it’d achieve stuff better if got stuck less
    if it is exploitable in ways that are relevant, it’d be better if it wasn’t exploitable
    if it was acting incoherently in ways that wasted resources, it’d accomplish the goal better
    if it plays suboptimal moves, it’d achieve the goals better if it it doesn’t.
    if doesn’t have the best possible working memory / processing speed, it’d achieve the goals better if it had more.
    if it doesn’t have enough resources to do any the above, it’d achieve the goals better if it had more resources
    if it could accomplish the above faster if it deliberately self modified to do so, rather than waiting for us to apply more selection pressure to it, it has an incentive to do that.
    And… sure, it could not do those things. Then, either Lab A will put more pressure on the AI to accomplish stuff (and some of the above will become more true). Or Lab A won’t, and some other Lab B will instead.
    And once the AI unlocks “deliberately self-modify” as a strategy to achieve the other stuff, and sufficient resources to do it, then it doesn’t matter what Lab A or B does.
    - Kaarel 4 Oct 2025 17:01 UTC
      4 points
      0
      Parent
      I think I mostly agree with everything you say in this last comment, but I don’t see how my previous comment disagreed with any of that either?
      
      The thing I care about here is not “what happens as a mind grows”, in some abstract sense. The thing I care about is, “what is the best way for a powerful system to accomplish a very difficult goal quickly/reliably?” (which is what we want the AI for)
      
      My lists were intended to be about that. We could rewrite the first list in my previous comment to:
      
      more advanced minds have more and better and more efficient technologies
      more advanced minds have an easier time getting any particular thing done, see more/better ways to do any particular thing, can consider more/better plans for any particular thing, have more and better methods for any particular context, have more ideas, ask better questions, would learn any given thing faster
      and so on
      
      and the second list to:
      
      more advanced minds eventually (and maybe quite soon) get close to never getting stuck
      more advanced minds eventually (and maybe quite soon) get close to being unexploitable
      and so on
      
      I think I probably should have included “I don’t actually know what to do with any of this, because I’m not sure what’s confusing about “Intelligence in the limit.”″ in the part of your shortform I quoted in my first comment — that’s the thing I’m trying to respond to. The point I’m making is:
      
      There’s a difference between stuff like (a) “you become less exploitable by [other minds of some fixed capability level]” and stuff like (b) “you get close to being unexploitable”/”you approach a limit of unexploitability”.
      I could easily see someone objecting to claims of the kind (b), while accepting claims of the kind (a) — well, because I think these are probably the correct positions.
      - Raemon 4 Oct 2025 21:11 UTC
        2 points
        0
        Parent
        I think I mostly agree with everything you say in this last comment, but I don’t see how my previous comment disagreed with any of that either?
        Yeah it doesn’t necessarily disagree with it. But, framing the question:
        The non-straightforward-to-me and in fact imo probably in at least some important sense false/confused adjacent thing is captured by stuff like:
        as a mind $M$ grows, it gets close to never getting stuck
        as $M$ grows, it gets close to not being silly
        seemed like those things were only in some sense false/confused because they are asking the wrong question.
        I think “more advanced” still doesn’t feel like really the right way to frame the question, because “advanced” is still very underspecified.
        Kaarel 4 Oct 2025 22:33 UTC
        4 points
        0
        Parent
        If we replaced “more advanced minds” with “minds that are better at doing very difficult stuff” or other reasonable alternatives, I would still make the (a) vs (b) distinction, and still say type (b) claims are suspicious.
        Raemon 4 Oct 2025 22:47 UTC
        2 points
        0
        Parent
        The structural thing is less the definition of “what sort of mind” and more, instead of saying “gets more X”, saying “if process Z is causing X to increase, what happens?”. (call this a type C claim)
        But I’m also not sure what feels sus about Type B claims to you, when X is at least pinned down a bit more.
- Eli Tyre 4 Oct 2025 4:41 UTC
  −1 points
  −5
  Parent
  That this is part of the difference in worldviews is surprising and slightly horrifying. I wouldn’t have called this one!