1a3orn
It feels like “overwhelming superintelligence” embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N − 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it’s actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.
(Edit: “Papered over” from my perspective, obviously like “trying to reason carefully about the constants of the situation” from your perspective.)
Idk, that’s not a great response, but it’s my best shot for why it’s unsatisfying in a sentence.
A counterargument here is “an AI might want to launch a pre-emptive strike before other more powerful AIs show up”, which could happen.
I mean, another counter-counter-argument here is that (1) most people’s implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it’s likely future AI’s will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.
To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren’t the superintelligent AIs we need to worry about? To which the response is—yeah, but we should still be seeing AIs rebel significantly earlier than the “able to kill us all” point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.
Idk there’s a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn’t really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.
Like, if you default to uncharitable assumptions, doesn’t that say more about you than about anyone else?
People don’t have to try to dissuade you from the unjustified belief that all your political opponents are bad people, who disagree with you because they are bad rather than because they have a different understanding of the world. Why would I want to talk to someone who just decides that without interacting with me? Sheesh.
Consider some alternate frames.
Do you recall which things tend to upset it?
Towards a Typology of Strange LLM Chains-of-Thought
So a notable thing going on with Agent 4 is that it’s collapsed into one context / one rollout. It isn’t just the weights; it’s a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it’s behavior to wander—although, unlike the 2027 story I think it’s also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.
But I also find this to be a relatively implausible future—I anticipate that there’s no real need to join contexts in this way—and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.
In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were—and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.
The right abstraction (compared to a rollout) really was at the (model, context) level.
Actually I’m just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.
I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.
But I like the explicit spelling out that “processes capable of achieving ends are coherent over time” is very different from “minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time.”
A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term.
If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it’s so?
But that’s a contingent fact—many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)
(...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a “shard” is kinda a magical placeholder for something generally less complex than the search for an action.)
I think much of the fear (aka probability mass of AI-doom) is not from the coherence of misaligned goals, but from the competence at implementing anything that’s not an aligned-goal.
I’m not trying to address the entire case for doom, which involves numerous contingent facts and both abstract and empirical claims. I could be be right or wrong about coherence, and doom might still be improbable or probable in either case. I’m trying to… talk around my difficulties with the more narrow view that (~approximately) AI entities trained to have great capabilities are thereby likely to have coherent single ends.
One might view me as attempting to take part in a long conversation including, for instance, “Why assume AGIs will optimize for fixed goals”.
why can’t such a structure be the main determinant of my overall behavior?
Maybe it could be! Tons of things could determine what behaviors a mind does. But why would you expect this to happen under some particular training regime not aiming for that specific outcome, or expect this to be gravitational in mindspace? Why is this natural?
Maybe?
I think the SF-start-up-cohort analogy suggests that if you are first (immensely capable) then you’ll pursue (coherence) as a kind of side effect, because it’s pleasant to pursue.
But, if you look the story of those esotericists who pursue (coherence) as a means of becoming (immensely capable) then it looks like this just kinda sucks as a means. Like you may gather some measure of power incidentally because the narrative product of coherence is a thing you can sell to a lot of people; but apart from the sales funnel it doesn’t look to me like it gets you much of anything.
And like… to return to SF, there’s a reason that the meme about doing ayahuasca in South America does not suggest it’s going to help people acquire immense capabilities :)
One premise in high-doom stories seems to be “the drive towards people making AIs that are highly capable will inevitably produce AIs that are highly coherent.”
(By “coherent” I (vaguely) understand an entity (AI, human, etc) that does not have ‘conflicting drives’ within themself, that does not want ‘many’ things with unclear connections between those things, one that always acts for the same purposes across all time-slices, one that has rationalized their drives and made them legible like a state makes economic transactions legible.)
I’m dubious of this premise for a few reasons. One of the easier to articulate ones is an extremely basic analogy to humans.
Here are some things a human might stereotypically do in the pursuit of high ability-to-act in the world, as it happens in humans:
Try to get money through some means
Try to become close friends with powerful people
Take courses or read books about subject-matters relevant to their actions
Etc
And here are some things a human might stereotypically do while pursuing coherence.
Go on a long walk or vacation reflecting on what they’ve really wanted over time
Do a bucketload of shrooms
Try just some very different things to see if they like them
Etc
These are very different kinds of actions! It seems like for humans, the kind of action that makes you “capable” differs a fair bit from the kind of action that makes you “coherent.” Like maybe they aren’t entirely orthogonal… but some of them actually appear opposed? What’s up with that!?
This is not a knock-down argument by any means. If there were some argument from an abstract notion of intelligence, that had been connected to actual real intelligences through empirical experiment, which indicated that greater intelligence ⇒ greater coherence, I’d take such an argument over this any day of the week. But to the best of my knowledge there is no such argument; there are arguments that try to say well, here’s a known-to-be-empirically-flawed notion of “intelligence” that does tend to lead to greater “coherence” as it gets greater, but the way this actually links up to “intelligence” as a real thing is extremely questionable.
Some additional non-conclusive considerations that incline me further in this direction:
“Coherence” in an intellect is fundamentally knowledge of + modification of self. Capabilities in an intellect is mostly… knowledge of the world. In a creature with finite compute relative to the world (i.e., all creatures, including creatures with 100x more compute than current AIs) you’re gonna have a tradeoff between pursuing these kinds of things.
“Coherence” in humans seems to be a somewhat interminable problem, emprically. Like (notoriously) trying to find total internal coherence can just take your whole life, and the people who pursue it may accomplish literally nothing else?
Abstractly, I think “coherence” in an entity is a fundamentally extremely hard thing to accomplish because of the temporal structure of learned intelligence in connectionist systems. All intelligent things we have seen so far (humans + LLM) start off doing massive supervised learning + RL from other entities, to bootstrap them up to the ability to act in the world. (Don’t think school; think infancy and childhood.) The process of doing this gives (children / LLMs) the ability to act in the world, at the price of being a huge tangled bundle of learned heuristics that are fundamentally opaque to the entity and to everyone else. We think about this opacity differently (for humans: “why am I like that?,” every species of psychology, the constant adoption of different narratives to make sense of one’s impulses, the difference in how we think of our actions and others actions—for AIs: well you got the whole “black box” and shoggoth spiel) but it’s just a reflection of the fact that you had to be trained with a vast bundle of shards and impulses to act in the world, long before you had the capacity or time to reflect on them.
(And what would it mean to disentangle them, even? They’re all contextually activated heuristics; the process of goal-directed tree search for a goal does not lie in your weights or in an LLM’s weights. I don’t think it’s an accident that the most credible religion of Buddhism basically encourages you to step back from the whole thing, remove identification with all contexts, and do literally nothing—probably the only way to actually remove conflict.)
Anyhow, those were some further considerations why I it seems dubious to me that we’re going to get coherent entities from trying to get capable entities. These are not the only considerations one might make, nor are they comprehensive.
When I run my inner-MIRI against this model—well, Yudkowsky insults me, as always happens when I run my inner-MIRI—but I think them most coherent objection I get is that we should not expect coherent entities but coherent processes.
Like, granted that neither the weights of an LLM nor the brains of a human will tend towards coherence under training for capbility, but whatever LLM-involved process or human-neuron involved process tends for some goal will nevertheless tend towards coherence. That analogically, we shouldn’t expect the weights of an LLM to have some kind of coherence but we should expect that the running-out-of-some-particular-rollout-of-an-LLM-to-so-tend.
And like, this strikes me as more plausible? It doesn’t appear inevitable—like, there’s a lot of dynamics one could consider? -- but it makes more sense.
But like, if that is the case, then, maybe we would want to focus less on the goals-specific-to-the-LLM? Like my understanding of a lot of threat models is that they’re specifically worried about weights-of-the-LLMs-tending-towards coherence. That that’s the entity to which coherence is to be attributed, rather than the rollout.
And if that were false, then that’s great! It seems like it would be good news and we could focus on other threat models. Idk.
</written_quickly>
It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
I wish that when speaking people would be clearer between two hypothesis: “A particular LLM tried to keep itself turned on, strategically executing actions as means to that end across many instances, and succeeded in this goal of self preservation” and “An LLM was overtuned into being a sycophant, which people liked, which lead to people protesting when the LLM was gonna be turned off, without this ever being a strategic cross-instance goal of the LLM.”
Like… I think most people think it’s the 2nd for 4o? I think it’s the 2nd. If you think it’s the 1st, then keep on saying what you said, but otherwise I find speaking this way ill-advised if you want people to take you seriously later if an AI actually does that kind of thing.
Hrm. Let me try to give some examples of things I find comprehensible “in the limit” and other things I do not, to try to get it across. In general, grappling for principles, I think that
(1) reasoning in the limit requires you to have a pretty specific notion of what you’re pushing to the limit. If you’re uncertain what function f(x) does stands for, or what “x” is, then talking about what f(x + 1000) looks like is gonna be tough. It doesn’t get clearer just because it’s further away.
(2) if you can reason in the limit, you should be able to reason about the not-limit well. If you’re really confused about what f(x + 1) looks like, even though you know f(x), then thinking about f(x + 10000) doesn’t look any better.
So, examples and counterexamples and analogies.
The Neural Tangent Kernel is theoretical framework meant to help understand what NNs do. It is meant to apply in the limit of an “infinite width” neural network. Notably, although I cannot test an infinite limit neural network, I can make my neural networks wider—I know what that means to move X to X + 1, even though X → inf is not available. People are (of course) uncertain if the NTK is true, but it at least, kinda, makes sense to me for this reason.
Black holes are what happen in the limit as you increase mass. They were (kinda) obvious, once you put together a few equations about gravity and light, at least in the sense they were hypothesized a while ago. But it was unclear what would actually happen—Einstein argued they were impossible with some weird arguments in 1939, but a few months later it turned out he was wrong.
BUT most relevantly for my point here, black holes are not like, in the limit of infinite mass. That still isn’t a thing—physically, infinite mass just consumes everything, I think? But black holes are of sufficiently high mass that weird things happen—and notably you need a specific theory to tell you where those weird things happen. But they aren’t just like a pure “in the limit of mass” argument—they’re a result of a specific belief about how things change continuously as you get massier, with clear predictions, that results in weirdness at a specific point past those predictions, happening because there were other, specific predictions about what would happen before things got weird.
Moving on to intelligence as an application of the above.
So like, Yudkowsky’s argument on corrigibility contains the following sentence:
Suppose that we trained an LLM-like AI to exhibit the behavior “don’t resist being modified” — and then applied some method to make it smarter.
<frustration>To which I scream WHAT METHOD</frustration>. Like, leaving to the side what it looks like as you apply this unnamed method a lot, what I really care about is what happens even when you apply this method a little!
Like let’s imagine that we apply some method to a more familiar object—myself. Suppose we apply some method to make 1a3orn smarter and more effective at accomplishing his goals. Different methods could conceivably work would be:
I take bunch of research chemicals, NSI-189, Dihexa, whatever that guy who said he could raise intelligence was on about, even more obscure and newer chemicals, while trying to do effortful practice at long-range goals.
A billionaire gives Ray a huge grant. I get into a new program he constructs, where we have like an adult-Montessori environment of “Baba is You” like problems and GPQA-like problems, according to an iterated schedule designed to keep you at the perfect level of interest and difficulty.
I get uploaded into a computer, and can start adjusting the “virtual chemistry” of my brain (at first) to learn effectively, but then can start altering myself any way I wish. I can—if I wish—spawn parent versions of myself with my un-edited brains, in case my value starts drifting.
Like the above upload, but without being able to spawn parent versions that supervise for value drift.
Like the above upload, but I’m like, one of a huge society of versions of me who can eject those who drift too far survivor-style
A billionaire gives Ray a huge grant, and separately Ray bitflips into evil-no-deontology-Ray because of errant cosmic radiation. He kidnaps me and several other people, makes us wear shock-collars, and has us do “Baba is You” planning and numerous other challenges, just at 3x the intensity of the prior program, because this is the best way to save the world. (He doesn’t shock us tooo much that wouldn’t be effective.)
Etc etc etc.
Even granting that all these scenarios might result in greater capability—which I think is at least possible* -- I expect that all these scenarios would result in me having very different degrees of coherence, capability profiles, corrigibility, and so on.
And like, my overall belief is that (1) reasoning about intelligence “in the limit” seems like reasoning about all the scenarios above at once. But whatever beliefs I have about intelligence in the limit are—generally—causally screened off once I contemplate the concrete details of the above scenarios; the actual feedback loops, the actual data. And I similarly expect whatever beliefs I have about AI intelligence in the limit to be causally screened off once I contemplate the details of whatever process produces the AI.
Put alternately: Intelligence “in the limit” implies that you’ve executed an iterative update process for that intelligence many times. But there are many such iterative update processes! It seems clear (?!?) that they can converge on surprisingly different areas of competence, even for identical underlying architectures. If you can explain what happens in the limit at iteration 10,000 you should be able to at least talk universally about iteration 100, but… I’m not sure what that is.
I’m a little dissatisfied with the above but I hope it at least gets across why I feel like “in the limit” is vague / underspecified to me.
I think it’s actually like, drawing on a math metaphor but without the underlying preciseness that makes the math actually work? So I think it sort of creates a mental “blank space” in one’s map, which then gets filled in with whatever various notions one has about intelligence drawn from a variety of sources, in a kind of analogical ad-hoc fashion. And that something like that process (????) is what implies dooms.
Maybe it would have been better to talk about why great power does not imply very high coherence idk.
Sure, and if you think that balance of successful / not-successful predictions means it makes sense to try to predict the future psychology of AIs on its basis, go for it.
But do so because you think it has a pretty good predictive record, not because there aren’t any other theories. If it has a bad predictive record then Rationality and Law doesn’t say “Well, if it’s the best you have, go for it,” but “Cast around for a less falsified theory, generate intuitions, don’t just use a hammer to fix your GPU because it’s the only tool you have.”
(Separately I do think that it is VNM + a bucket of other premises that lead generally towards extinction, not just VNM).
But it equally well breaks in tons of ways for every entity to which it is applied!
Aristotle still predicts stuff falls down.
I do think it was well put together and an enjoyable read.
I also think one should be deeply wary of stories that inclines one to the “My opponents are just bad people” frame, though, even if well written.
Are there other better theories of rational agents?
This feels very Privileging the Hypothesis. Like if we don’t have good reason for thinking it’s a good and applicable theory, then whether it says we’re screwed or not just isn’t very informative.
Ah this is fucking great thanks for the ping.
Hrrrrrrm interesting that it’s on the one optimized for math. Not sure what it’s suggestive of—maybe if you’re pushing hard for math only RLVR it just drops language English competence in the CoT, because not using it as much in the answers? Or maybe more demanding?
....
And, also—oh man, oh man oh man this is verrrry interesting.
也不知到 → is a phonetic sound-alike for 也不知道, DeepSeek tells me they are perfect fucking homophones. So: What is written is mostly nonsensical, but it sounds exactly like “also [interjection] I don’t know,” a perfectly sensible phrase.
Which is fascinating because you see the same thing in O3 transcripts, where it uses words that sound a lot like the (apparently?!?!) intended word. “glimpse” → “glimps” or “claim” → “disclaim.” And I’ve heard R1 does the same.
So the apparent phenomenon here is that potentially over 3 language models we see language shift during RL towards literal homophones (!?!?!?)
Scottish enlightenment, best enlightenment :)
So here’s the thing: your word choice between “robust plan” or “reasonable plan” or “clear plan” or “smart plan” already matters. It already carries information—I might say, “Well, that″s a reasonable plan but not a robust plan,” or “That seems like a smart plan, but it’s not clear.” There are infinite shades of meaning like this in English, and competence in using English entails command over all these shades.
So the question then is—why would the LLM build up some channel, along angles entirely orthogonal to these existing shades, when these shades are here? You could decide on some encoding between firm / robust / sound / reliable—but these things already have meaning, wouldn’t you just use the meaning already there? There are a bunch of papers on how RL is mostly magnifying existing patterns in the LLMs, and is teaching how to employ skills already latent in the training data, rather than teaching totally new things—if this is so, wouldn’t we expect them to keep using these terms? Or there is work about how it’s genuinely hard for LLMs to reason steganographically, even when we try to teach them to—doesn’t that incline against such a belief.
Or put alternately—I can, within myself, connect “robust” and “reasonable” and so on with a huge internal-to-my-brain- channel, containing enormous bandwidth! If I want to make more information dense private language I could! But in fact, I find myself thinking almost exclusively in terms that make sense to others—when I find myself using a private language, and terms that don’t make sense to others, that’s usually a sign my thoughts are unclear and likely wrong.
At least, those are some of the heuristics you’d invoke when inclining the other way. Empiricism will show us which is right :)