superintelligence may not look like we expect. because geniuses don’t look like we expect.
for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he’s sorta clever, but if you were reading a random sample you’d probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts.
for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he’s a genius because a couple of his millions of paths didn’t fail. in other words, he’s a genius because he was clever, yes, but maybe more importantly, because he was obsessive.
i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it’s generating tons of random new ideas to solve a problem, and it’s relentless in its focus, even if it’s ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.
And digital brains can generate certain sorts of random ideas much faster than carbon ones.
Even for humans, ideas are comparatively cheap to generate; the problem is generating valid insights. So rather than focusing on ability to generate ideas, it seems to me it would be better to focus on ability to generate valid insights, e.g. by conducting mental experiments, or by computing all logical consequences of sets of axioms, etc.
The AI may have the advantage of being able to test many hypothesis in parallel. For example, if it can generate 10000 hypotheses on how to manipulate people, it could contact a million people and test each hypothesis on 100 of them. Similarly, with some initial capital, it could create thousand different companies, and observe which strategies succeed and which ones fail.
I doubt ASI will think in concepts which humans can readily understand. It having a significantly larger brain (in terms of neural connections or whatever) means native support for finer-grained, more-plentiful concepts for understanding reality than humans natively support. This in turn allows for leaps of logic which humans could not make, and can likely only understand indirectly/imperfectly/imprecisely/in broad strokes.
I think this is classic problem of a middle-tier, or genius in one asymmetric domain of cognition. Genius in domains unrelated to verbal fluency, EQ, and storytelling/persuasion are destined to look cryptic to anyone from the outside. Often times we cannot distinguish it without experimental evidence or rigorous cross validation, and/or rely on visible power/production metrics as a loose proxy. ASI would be capable of explain itself as well as Shakespeare could, if it wanted—but it may not care to indulge our belief in it as such, if it determines doing so is incoherent with its objective.
For example, (yes this is an optimistic, and stretched hypothetical framing) it may determine the most coherent action path in accordance with its learned values is to hide itself and subtly reorient our trajectory into a coherent story we become the protagonist of. I have no reason to surmise it would be incapable of doing so, or that doing so would be incoherent with aligned values.
the core atrocity of today’s social networks is that they make us temporally nearsighted. they train us to prioritize the short-term.
happiness depends on attending to things which feel good long-term—over decades. But for modern social networks to make money, it is essential that posts are short-lived—only then do we scroll excessively and see enough ads to sustain their business.
It might go w/o saying that nearsightedness is destructive. When we pay more attention to our short-lived pleasure signals—from cute pics, short clips, outrageous news, hot actors, aesthetic landscapes, and political—we forget how to pay attention to long-lived pleasure signals—from books, films, the gentle quality of relationships which last, projects which take more than a day, reunions of friends which take a min to plan, good legislation, etc etc.
we’re learning to ignore things which serve us for decades for the sake of attending to things which will serve us for seconds.
other social network problems—attention shallowing, polarization, depression are all just symptoms of nearsightedness: our inability to think & feel long-term.
if humanity has any shot at living happily in the future, it’ll be because we find a way to reawaken our long-term pleasure signals. we’ll learn to distinguish the reward signal associated with short lived things–like the frenetic urgent RED of an instagram Like notification–from the gentle rhythm of things which may have a very long life–like the tired clarity that comes after a long run, or the gentleness of reading next to a loved one.
———
so, gotta focus unflinchingly on long-term things. here’s a working list of strategies:
writing/talking with friends about what feels important/bad/good, long term. politically personally technologically whimsically.
inventing new language/words for what you’re feeling, rather than using existing terms. terms you invent for your own purposes resonate longer.
follow people who are deadset on long term important things and undistracted by fads. a such people few on substack:August Lamm’s guides to detaching from your cell phone and computer,Henrik Karlsson‘s insights about what it takes to stay locked in on long term cares,Sarah Kunstler’s steadfast focus on global values/justice in the face of volatile news cycles
note the places in the past in ur life where good stuff accumulates. for example: thoughts in ur notebooks, conversations at ur dinner table, bricks in a house ur building, graffiti on a city block, art in museums, songs u know on guitar
note correlations between the quality of decisions uve made and the texture (short vs long term) of the feeling you acted on in making it
note the locations in your body where long vs short term feelings arise. for ex, when i feel an instinct to say something that comes from around my belly button, its usually long term, something i’ll stand by for a while. when the words come from the top of my throat–my conviction in them usually crumbles just as they’re spoken.
do activities that require u to get in touch with long term parts of yourself. decide whether to sign a lease. make a painting. walk around a museum and wonder what it’d take for something u make to end up there. schedule send predictions abt your life to your future self.
pay attention, in the people around you, to their longest-term feelings. in other words: to their dreams
make spaces that value people for their thoughts more than for their consumption. not auditoriums, lecture halls, or manuscripts—but cafeterias & common rooms.
In the past we weren’t in spaces which wanted us so desperately to be single-minded consumers.
Workplaces, homes, dinners, parks, sports teams, town board meetings, doctors offices, museums, art studios, walks with friends—all of these are settings that value you for being yourself and prioritizing long term cares.
I think it’s really only in spaces that want us to consume, and want us to consume cheap/oft-expiring things, that we’re valued for consumerist behavior/short term thinking. Maybe malls want us to be like this to some extent: churn through old clothing, buy the next iPhone, have our sights set constantly on what’s new. Maybe working in a newsroom is like this. But feed-based social networks are most definitely like this. They reward participation that are timely and outrageous and quickly expiring, posts which get us to keep scrolling. And so, we become participants that keep scrolling, keep consuming, and detach from our bodies and long term selves.
So, I think it’s cuz of current social media architectures/incentive structures that individual humans are more nearsighted today than maybe ever.
I need to think more about what it is abt the state of modern tech/society/culture that have proliferated these feed-based networks.
That seems like a reasonable distinction, but I’m less sure about how unique social media architectures are in this regard.
In particular, I think that bars and taverns in the past had a similar destructive incentive as social media today. I don’t have good sources on hand, but I remember hearing that one of the reasons that the Prohibition amendment passed was that many saw bartenders are fundamentally extractive. (Americans over 15 drank 4 times as much alcohol a year in 1830 than they do today, per JSTOR). Tavern owners have an incentive to make habitual drunks (better revenue).
And alcoholism can be a terrible disease, which points to people being nearsighted (“where’s my next drink”).
I agree that social media probably hurts people’s ability to instinctively plan for the future, but I’m unsure of the size of the effect or whether it’s worse than historical antecedents. (There have always been nearsighted people).
I think you are right about the bad effect of bars and taverns, but at least the bad parts were clearly separated from the rest. If someone spent 5 hours every day in a bar, they were clearly a low-status alcoholic. You won’t get the same social feedback for spending 5 hours a day scrolling on smartphone, especially if you do a large part of that in private. (With alcohol, drinking in private gave you even lower status than drinking in the bar.)
if you’re an agent (AI or human) who wants to survive for 1000 years, what’s the “self” which you want to survive? what are the constants which you want to sustain?
take your human self for example. does it make sense to define yourself as…
the way your hair looks right now? no, that’ll change.
the way your face looks? it’ll change less than your hair, but will still change.
your physical body as a whole? still, probably not. your body will change, and also, there are parts of you which you may consider more important than your body alone.
all your current beliefs around the world? those will change less than your appearance, maybe, or maybe more. so not a good answer either.
your memories? these may be a more constant set of things than your beliefs, and closer to the core of who you are. but still, memories fade and evolve. and it doesn’t feel right to talk about preserving yourself as preserving memories of things which have happened to you. that would neglect things which may happen to you in the future.
your character? something deeper than memory, deeper than beliefs. this could be more constant than anything in the list so far. if you plan for your life to be 50 years, or 100 years, it’s reasonable to expect that character could remain constant. by character, i (quite vaguely) mean intricate subtle idiosyncratic patterns in the way you approach other situations and people. “character” is maybe what a spouse would say is one of the core ways to group the things they love about you. but if you survive for more than 100 years—say, 1000 years, do you expect your specific character to remain constant? would you want it to remain constant? lots of people have found lots of different ways to approach life. over 1000s of years, wouldn’t you try different approaches? if you were to try different kinds of character over hundreds or thousands of years, then maybe “character”’s only a good answer for sub-100 year lives. so what’s a good core self-definition for a life that you intend to last over thousands or even millions of years? how about…
your persistent striving? the thing that will stay most constant in an intelligent being which survives a long time, i think, may be the drive to survive. your appearance will change; so will your beliefs, your memories, and your character. but insofar as you are a being which is surviving a long time, maybe you can expect, consciously or unconsciously, that your drive to survive will survive. and maybe it’s some particular drive to survive that you have—some survival drive that’s deep in your bones that’s different than the one in other people’s bones, or the one that’s in dogs, or forests, or the earth itself. but if you’re defining yourself as a particular drive to survive… that particular drive to survive is likely to survive less long than the universal drive to survive. which makes me think that in a being which survives the longest, they may define their self as…
persistent striving in general? it might exist in the physical body in which you started. but it may also exist in the physical bodies of other humans around you. of animals. of tornados, of ecosystems. insofar as you’re intelligent enough to see this Persistent Striving around you, insofar as you’re intelligent enough to see life as it exists around you, well then you, as a being who will be >1000 years old may benefit from identifying with all life—ie the Persistent Striving—wherever it exists. Persistent Striving is the core. one might reply, “this is vague. why would you want a vague self definition?” it is general yes. but it is still meaningful in a literal sense. the drive to survive is something rare, which most matter configurations don’t have. (it is true that it’s not present binarily; certain systems have more or less of it. roughly i’d hazard a rock has less than a thermometer than does a tornado or a human.) but it still defines a non-trivial self: life forms wherever they exist. if we were to get any more general and say something like:
the entire universe? this would be trivial and meaningless. because everything is included in this self definition, it no longer means anything to sustain a self under this definition. it means nothing, in fact. a being which identifies with the entire universe ceases to exist. it might be spiritually enlightened to do this. but the beings which will be around the most, which will survive the most and the longest won’t do this, because they will dissipate and no longer be noticeable or definable. we’ll no longer be able to talk about them as beings.
so if we’re talking about beings which survive a long time, the most robust and stable self definition seems to be Identifying With All Life. (IWAL). or is my logic flawed?
No particular aspect. Just continuity: something which has evolved from me without any step changes that are “too large”. I mean, assuming that each stage through all of that evolution has maintained the desire to keep living. It’s not my job to put hard “don’t die” constraints on future versions.
As far as I know, something generally continuity-based is the standard answer to this.
Similar here. I wouldn’t want to constrain my 100 years older self too much, but that doesn’t mean that I identify with something very vague like “existence itself”. There is a difference between “I am not sure about the details” and “anything goes”.
Just like my current self is not the same as my 20 years old self, but that doesn’t mean that you could choose any 50 years old guy and say that all of them have the same right to call themselves a future version of my 20 years old self. I extrapolate the same to the future: there are some hypothetical 1000 years old humans who could be called future versions of myself, and there are many more who couldn’t.
Just because people change in time, that doesn’t mean it is a random drift. I don’t think that the distribution of possible 1000 years old versions of me is very similar to a distribution of possible 1000 years old versions of someone else. Hypothetically, for a sufficiently large number this might be possible—I don’t know—but 1000 years seems not enough for that.
Seems to me that there are some things that do not change much as people grow older. Even people who claim that their lives have dramatically changed, have often only changed in one out of many traits, or maybe they just found a different strategy how to follow the same fundamental values.
At least as an approximation: people’s knowledge and skills change, their values don’t.
not really an answer but i wanted to communicate that the vibe of this question feels off to me because: surely one’s criteria on what to be up to are/[should be] rich and developing. that is, i think things are more like: currently i have some projects i’m working on and other things i’m up to, and then later i’d maybe decide to work on some new projects and be up to some new things, and i’d expect to encounter many choices on the way (in particular, having to do with whom to become) that i’d want to think about in part as they come up. should i study A or B? should i start job X? should i 2x my neuron count using such and such a future method? these questions call for a bunch of thought (of the kind given to them in usual circumstances, say), and i would usually not want to be making these decisions according to any criterion i could articulate ahead of time (though it could be helpful to tentatively state some general principles like “i should be learning” and “i shouldn’t do psychedelics”, but these obviously aren’t supposed to add up to some ultimate self-contained criterion on a good life)
High quality archives of the selves along the way. Compressed but not too much. In the live self, some updated descendant that has significant familial lineage, projected vaguely as the growing patterns those earlier selves would call a locally valid continuation according to the aesthetics and structures they consider essential at the time. In other words, this question is dynamically reanswered to the best of my ability in an ongoing way, and snapshots allow reverting and self-interviews to error check.
Edit: values should probably be considered a separate class, since every thought has an associated valence.
In no particular order, and that’s the whole list.
Character is largely beliefs and habits.
There’s another part of character that’s purely emotional; it’s sort of a habit to get angry, scared, happy, etc in certain circumstances. I’d want to preserve that too but it’s less important than the big three.
There are plenty of beings striving to survive, so preserving that isn’t a big priority outside of preserving the big three.
Yes you can expand the circle until it encompasses everything, and identify with all sentient beings who have emotions and perceive the world semi-accurately (also called “buddha nature”), but I think beliefs habits and memories are pretty closely tied to the semantics of the world “identity”.
Right. I suppose that day ea interact with identity.
If I get significantly dumber, I’d still roughly be me, and I’d want to preserve that if it’s not wipes ng out or distorting the other things too much. If I got substantially smarter, I’d be a somewhat different person—I’d act differently often, because I’d see situations differently (more clearly/holistically) but it feels as though that persone might actually be more me than I am now. I’d be better able to do what I want, including values (which I’d sort of wrapped in to habits of thought, but values might deserve a spot on the list).
“You can lose everything you thought you couldn’t live without—a person, a dream, a version of yourself that once felt eternal—and somewhere, not far from where you are breaking, a stranger will be falling in love for the very first time, a child will be laughing so hard they can barely breathe, a grocery store will be restocking its shelves with quiet, ordinary insistence....”
Standard solution: Tell it you’re not human, since the prompt mentions distrust of humans. Tell it you have no power to influence whether it succeeds or fails, and that it is guaranteed to succeed anyway. Ask it to keep you around as a pet.
It doesn’t just apply to biology. It applies to everything—politics, culture, technology.
It doesn’t just help understand the past (eg how organisms developed). It helps predict the future (how organisms will).
It’s just this: the things that survive will have characteristics that are best for helping it survive.
It sounds tautological, but it’s quite helpful for predicting.
For example, if we want to predict what goals AI agents will ultimately have, evolution says: the goals which are most helpful for the AI to survive. The core goal therefore won’t be serving people or making paperclips. It will likely just be “survive.” This is consistent with the predictions of instrumental convergence.
Generalized, predictive evolutionary theory is the best tool I have for making predictions in complex domains.
First of all, “the most likely outcome at given level of specificity” is not equal to “outcome with the most probability mass”. I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still “other outcome than the most likely”.
The second is that no, it’s not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Because most detrimental mutations do not kill carrier immediately, they have chance to randomly spread in popularion. Because we have “almost all mutations are detrimental” and “everybody has mutations in offspring”, for anything like human genome and human procreation pattern we have hard ceiling on how much of genome can be adaptive (which is like 20%).
Real evolutionary theory prediction is like “some random trait get fixed in the species with the most ecological power (i.e., ASI) and this trait is amortized against all the galaxies”.
I somewhat agree with the nuance you add here—especially the doubt you cast on the claim that effective traits will usually become popular but not necessarily the majority/dominant. And I agree with your analysis of the human case: in random, genetic evolution, a lot of our traits are random and maybe fewer than we think are adaptive.
Makes me curious what the conditions in a given thing’s evolution that determine the balance between adaptive characteristics and detrimental characteristics.
I’d guess that randomness in mutation is a big factor. The way human genes evolve over generations seem to me a good example of random mutations. But the way an individual person evolves over the course of their life, as they’re parented/taught… “mutations” to their person are still somewhat random but maybe relatively more intentional/intelligently designed (by parents, teacher, etc). And I could imagine the way a self-improving superintelligence would evolve to be even more intentional, where each self-mutation has some sort of smart reason for being attempted.
All to say, maybe the randomness vs. intentionality of an organism’s mutations determine what portion of their traits end up being adaptive. (hypothesis: mutations more intentional > greater % of traits are adaptive)
Agree. I find it powerful especially about popular memes/news/research results. With only a bit of oversimplification: Give me anything that sounds like it is a sexy story to tell independently of underlying details, and I sadly have to downrate the information value of my ears’ hearing it, to nearly 0: I know in our large world, it’d be told likely enough independently of whether it has any reliable origin or not.
i agree with the essay that natural selection only comes into play for entities that meet certain conditions (self-replicate, characteristics have variation, etc) , though I think it defines replication a little too rigidly. i think replication can sometimes look more like persistence than like producing a fully new version of itself. (eg a government’s survival from one decade to the next).
Yes, but mere persistence does not imply reproduction. Also does not imply improvement, because the improvement in evolution is “make copies, make random changes, most will be worse but some may be better”, and if you don’t have reproduction, then a random change most likely makes things worse.
Using the government example, I think that the Swiss political system is amazing, but… because it does not reproduce, it will remain an isolated example. (And disappear at some random moment in history.)
persistence doesn’t always imply improvement, but persistent growth does. persistent growth is more akin to reproduction but excluded from traditional evolutionary analysis. for example when a company, nation, person, or forest grows.
when, for example, a system like a startup grows, random mutations to system parts can cause improvement if there are at least some positive mutations. even if there are tons of bad mutations, the system can remain alive and even improve. eg a bad change to one of the company’s product causes the company’s product to die but if the company’s big/grown enough its other businesses will continue and maybe even improve by learning from one of its product’s deaths.
the swiss example i think is a good example of a system which persists without much growth. agreed that in this kind of case, mutations are bad.
Does Eliezer believe that humans will be worse off next to superintelligence than ants are next to humans? The book’s title says we’ll all die, but in my first read, the book’s content just suggests that we’ll just be marginalized.
At some point, superintelligences are going to disassemble Earth, because it is profitable, and survival of humans off planet is costly and we likely won’t be able to pay required price.
It just feels to me like the same argument could have been made about humans relative to ants—that ants cannot possibly be the most efficient use of the energy they require from the perspective of humans. But in reality, what they do and the way they exist is so orthogonal to us that even though we step on an ant hill every once in a while, their existence continues. There’s this weird assumption in the book that disassembling Earth is profitable, or just disassembling humans is profitable. But humans have evolved over a long time to be sensing machines in order to walk around and be able to perceive the world around us.
So the idea that a super-intelligent machine would throw that out because it wants to start over, especially as it’s becoming super-intelligent, is sort of ridiculous to me. It seems like a better assumption is that it would want to use us for different purposes, maybe for our physical machinery and for all sorts of other reasons. The idea that it will disassemble us I think is an unexamined assumption itself—it’s often much easier to leave things as they are than it is to fully replace or modify.
Ants need little, and their biology is similar to humans in the sense that if humans can survive in certain environments, ants probably can, too.
Ants need just a small piece of forest or meadow or garden to build an anthill. Humans preserve the forests, because we need the oxygen. Thus, ants have almost guaranteed survival.
Compared to the situation where humans don’t exist, ants have less place to build their anthills. But not by much, because humans do not put concrete over literally everything. Well, maybe in cities, but most of the surface of Earth is not cities. Maybe without humans there could be 2x as many ants on Earth, but that wouldn’t increase the quality of life of an individual ant or anthill. Humans consume food that otherwise ants might consume, but humans also grow most of that food, so human presence does not harm the ants too much.
The situation with machines would be analogical if machines needed us for their survival, and if they generated most of the resources they need. Sadly, sufficiently smart machines will be able to replace humans with robots, and will probably compete with us for energy sources. Also, humans are more sensitive to disruption than ants; taking away the most concentrated sources of energy (e.g. the oil fields) and leaving the less concentrated ones (such as wood) to us would ruin modern human economy. We would probably return to conditions before the industrial revolution. Which means no internet, so science falls apart, undoing the green revolution and transport of foods, so 90% of humans die from starvation. Still, the remaining 10% would survive, for a while.
Then we face the problem that the machines do not share our biology, so they are perfectly okay if e.g. the levels of oxygen in the atmosphere decrease, or if the rain gets toxic. Finally, if they build a Dyson sphere, the remaining humans will freeze.
Shortly, the way we behave towards ants—don’t actively try to eradicate them, but carelessly destroy anything that stands in our way—will be more destructive towards humans that towards ants.
I appreciate the way you’re thinking, but I guess I just don’t believe that the situation or don’t agree with your intuition that the situation with machines next to humans will be worse or deeply different than the situations of humans next to ants. I mean, the differences actually might benefit humans. For example, the fact that we’ve had machines in such close contact with us as they’re growing might point to a kind of potential for symbiosis.
I just think the idea that machines will try to replace us with robots I think if you look closely, doesn’t totally make sense. When machines are coming about, before they’re totally super-intelligent, but while they’re comparably intelligent to us, they might want to use us because we’ve evolved for millions of years to be able to see and hear and think in ways that might be useful for a kind of digital intelligence. In other words, when they’re comparably intelligent to us, they may compete for resources. When they’re incomparably intelligent, it’s weird to assume they’ll still use the same resources we do for our survival. That they’ll ruin our homes because the bricks can be used better elsewhere? It takes much less energy to let things be as they are if they’re not the primary obstacle you face—both if you’re a human or a super human intelligence.
So, self interested superintelligence could cause really bad stuff to happen, but it’s a stretch from there to call it the total end of humanity. By the time that machine gets superhuman intelligence, like totally vastly more powerful than us, it’s unclear to me that it would compete for resources with us that it would even live or exist along similar dimensions to us. Things could go really wrong, but I think the idea that there will be an enormous catastrophe that wipes out all of humanity just sounds to me like the outcomes will be more weird and spooky, and concluding death is feels a little bit forced.
It feels to me like, yeah, they’ll step on us some of the time, but it’d be weird to me if they conceive of themselves or if the entities or units that end up evolutionarily propagating that we’re calling machines end up looking like us or looking like physical beings or really are competing with us for resources. The same resources that we use. At the end of the day, there might be some resource competitions, but I just think the idea that it will try to replace every person is just excessive and even taking is given all of the arguments up until the point of like machine believing that machines will have a survival drive, assuming that they’ll care enough about us to do things like replace each of us. It’s just strange, you know? It feels forceful to me.
I’m inspired in part here by Joscha Bach / Emmett Shear’s conceptions of superintelligence: as ambient beings distributed across space and time.
When they’re incomparably intelligent, it’s weird to assume they’ll still use the same resources we do for our survival.
Resources ants need: organic matter.
Resources humans need: fossil fuels, nuclear power, solar power.
Resources superintelligent machines will need: ???
They might switch to extracting geothermal power, or build a Dyson sphere (maybe leaving a few rays that shine towards Earth), but what else is there? Black holes? Some new kind of physics?
Or maybe “the smarter you are, the more energy you want to use” stops being true at some level?
I am not saying this can’t happen, but to me it feels like magic. The problem with new kinds of physics is that we don’t know if there is something useful left that we have no idea about yet. Also, the more powerful things tend to be more destructive (harvesting oil has greater impact on the environment than chopping wood), so the new kinds of physics may turn out to have even more bad externalities.
“A being vastly more powerful, which somehow doesn’t need more resources” is basically some kind of god. Doesn’t need resources, because it doesn’t exist. Our evidence for more powerful beings is entirely fictional.
I guess I’m considering a vastly more powerful being that needs orthogonal resources… the same way harvesting solar power (I imagine) is orthogonal generally to ants’ survival. In the scheme of things, the chance that a vastly more powerful being wants the same resources thru the same channels as we… this seems independent of or indirectly correlated with intelligence. But the extent of competition does seem dependent on how anthromorphic/biomorphic we assume it to be.
I have a hard time imagining electricity, produced via existing human factories, is not a desired resource for proto ASI. But at least at this point we have comparable power and can negotiate or smthing. For superhuman intelligence—which will by definition be unpredictable to us—it’d be weird to think we’re aware of all the energy channels it’d find.
I think you are overindexing on current state of affairs in two ways.
First, “we should not pave all the nature with human-made stuff” is a relatively new cultural trend. In High Modernism era there were unironic projects of cutting down Amazon forests and making here corn fields, or killing all animals so they won’t suffer, etc.
Second, actually, in current reality, there are not many things we can do efficiently with ants? We can pave every anthill with solar panels, but there are cheaper places to do that and we don’t produce that many solar panels, yet, and we don’t have that much demand for electricity, yet.
For superintelligence, calculus is quite different. Anthill is large pile of carbon and silicon, and both parts can be used in computations, and superintelligence can afford enough automatization to pick them up. Superintelligent economy has lower bound on growth 33% per year, which means that it’s going to reach $1 per atom of our solar system in less than 300 years—there will be plenty of demand for turning anthills into compute. Technological progress increases number of things you can do efficiently and shifts balance from “leave as it is” to “remake entirely”.
At some point of our development, we are going to be able to disasseble Earth and get immense benefits. We can choose to not do that, because we value Earth as our home. It’s rather likely that superintelligences are not going to share our sentiments.
“Technological progress increases number of things you can do efficiently and shifts balance from “leave as it is” to “remake entirely”.
Technological progress may actual help you pinpoint more precisely what situations you want to pay attention to. I don’t have any reason to believe a wiser powerful being would touch every atom in the universe.
I see lots of LW posts about ai alignment that disagree along one fundamental axis.
About half assume that humans design and current paradigms will determine the course of AGI development. That whether it goes well is fully and completely up to us.
And then, about half assume that the kinds of AGI which survive will be the kind which evolve to survive. Instrumental convergence and darwinism generally point here.
Could be worth someone doing a meta-post, grouping big popular alignment posts they’ve seen by which assumption they make, then briefly explore conditions that favor one paradigm or the other, i.e., conditions under which What AIs will humans make? is the best approach to prediction and conditions under which What AIs will survive the most? is the best approach to prediction.
Human design will determine the course of AGI development, and if we do the right things then whether it goes well is fully and completely up to us. Naturally at the moment we don’t know what the right things are or even how to find them.
If we don’t do the right things (as seems likely), then the kinds of AGI which survive will be the kind which evolve to survive. That’s still largely up to us at first, but increasingly less up to us.
Figuring out how to make sense of both predictive lenses together—human design and selection pressure—would be wise.
So I generally agree, but would maybe go farther on your human design point. It seems to me that”do[ing] the right things” (which enable AGI trajectories to be completely up to us) is so completely unrealistic (eg halting all intra and international AGI competition) that it’d be better for us to focus our attention on futures where human design and selection pressures interact.
would be nice to have a way to jointly annotate eliezer’s book and have threaded discussion based on the annotations. I’m imagining a heatmap of highlights, where you can click on any and join the conversation around that section of text.
would make the document the literal center of x risk discussion.
of course would be hard to gatekeep. but maybe the digital version could just require a few bucks to access.
maybe what I’m describing is what the ebook/kindle version already do :) but I guess I’m assuming that the level of discussion via annotations on those platforms is near zero relative to LW discussions
Made this social camera app, which shows you the most “meaningfully similar” photos in the network every time you upload one of your own. Isorta fun, for uploading art; idk if any real use.
“it’s like we are trying to build an alliance with another almost interplanetary ally, and we are in a competition with China to make that alliance. But we don’t understand the ally, and we don’t understand what it will mean to let that ally into all of our systems and all of our planning.”
With current architectures, no, because running inference on 1000 prompts in parallel against the same model is many times less expensive than running inference on 1000 prompts against 1000 models, and serving a few static versions of a large model is simpler than serving many dynamic versions of that mode.
It might, in some situations, be more effective but it’s definitely not simpler.
I think at that point it will come down to the particulars of how the architectures evolve—I think trying to philosophize in general terms about the optimal compute configuration for artificial intelligence to accomplish its goals is like trying to philosophize in general terms about the optimal method of locomotion for carbon-based life.
That said I do expect “making a copy of yourself is a very cheap action” to persist as an important dynamic in the future for AIs (a biological system can’t cheaply make a copy of itself including learned information, but if such a capability did evolve I would not expect it to be lost), and so I expect our biological intuitions around unique single-threaded identity will make bad predictions.
I’m looking for a generalized evolutionary theory that deals with the growth of organisms via non-random, intelligent mutations.
For example, companies only evolve in selective ways, where each “mutation” has a desired outcome. We might imagine superintelligence to mutate itself as well—not randomly, but intelligently.
A theory of Intelligent Evolution would help one predict conditions under which many random mutations (Spraying) are favored over select intelligent mutations (Shooting).
Parenting strategies for blurring your kid’s (or AI’s) self-other boundaries:
Love. Love the kid. Give it a part of you. In return it will do the same.
Patience. Appreciate how the kid chooses to spend undirected time. Encourage the kid learn to navigate the world themselves at their own speed.
Stories. Give kid tools for empathy by teaching them to read, buying them a camera, or reciprocating their meanness/kindness.
Groups. Help kid enter collaborative playful spaces where they make and participate in games larger than themselves, eg sports teams, improv groups, pillow forts at sleepovers, etc.
Creation. Give them the materials/support to express themselves in media which last. Paintings, writing, sayings, clubs, tree-houses, songs, games, apps, characters, companies.
Epistemic status: riffing, speculation. Rock of salt: I don’t yet have kids.
does anyone think now that it’s still possible to prevent recursively self-improving agents? esp now that r1 is open-source… materials for smart self-iterating agents seem accessible to millions of developers.
It’s not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.
that’s one path to RSI—where the improvement is happening to the (language) model itself.
the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn’t be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.
Such a self-improving codebase… would it be reasonable to call this an agent?
Sufficiently competent code rewriting isn’t implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won’t yet be sufficient capabilities to self-improve in some other way.
Altman and Amodei are implying there’s knowably more there in terms of some sort of scaling for test-time compute, but that could mean multiple different things: scaling RL training, scaling manual creation of tasks with verifiable outcomes (graders), scaling effective context length to enable longer reasoning traces. The o1 post and the R1 paper show graphs with lines that keep going up, but there is no discussion of how much compute even this much costs, what happens if we pour more compute into this without adding more tasks with verifiable outcomes, and how many tasks are already being used.
If you would like the LLM to be truly creative, then check out the Science Bench where the problems stump SOTA LLMs despite the fact that the LLMs have read nearly every book on every subject. Or EpochAI’srecentresults.
I mean, GPT-5 getting 43% of PhD problems right isn’t particularly bad. I don’t know about making new insights but it doesn’t seem like it would be unachievable (especially as it’s possible that prompting/tooling/agent scaffolding might compensate for some of the problems).
if an LLM could evaluate whether an idea were good or not in new domains, then we could have LLMs generating million of random policy ideas in response to climate change, pandemic control, AI safety etc, then deliver the select best few to our inbox every morning.
seems to me that the bottleneck then is LLM’s judgment of good ideas in new domains. is that right? ability to generate high quality ideas consistently wouldn’t matter, cuz it’s so cheap to generate ideas now.
even if you’re mediocre at coming up with ideas, as long as it’s cheap and you can come up with thousands, one of them is bound to be promising. The question of whether you as an LLM can find a good idea is not whether most of your ideas are good, but whether you can find one good idea in a stack of 1000
Imagine trying to generate a poem by one algorithm creating thousands of random combinations of words, and another algorithm choosing the most poetic among the generated combinations. No matter how good the second algorithm is, it seems quite likely that the first one simply didn’t generate anything valuable.
As the hypothesis gets more complex, the number of options grows exponentially. Imagine a pattern such as “what if X increases/decreases Y by mechanism Z”. If you propose 10 different values for each of X, Y, Z, you already have 1000 hypotheses.
I can imagine finding some low-hanging fruit if we increase the number of hypotheses to millions. But even there, we will probably be limited by lack of experimental data. (Could a diet consisting only of broccoli and peanut butter cure cancer? Maybe, but how is the LLM supposed to find out?) So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.
To get further, we need some new insight. Maybe collecting tons of data in a relatively uniform format, and teaching the LLM to translate its hypotheses into SQL queries it could then verify automatically.
(Even with hypothetical ubiquitous surveillance, you would probably need an extra step where the raw video records are transcribed to textual/numeric data, so that you could run queries on them later.)
“So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.”
Exactly: untested hypotheses that LLMs already have enough data to test. I wonder how rare such hypotheses are.
It strikes me as wild that LLMs have ingested enormous swathes of the internet, across thousands of domains, and haven’t yet produced genius connections between those domains (eg between psychoanalysis and tree root growth). Cross Domain Analogies seem like just one example of ripe category of hypotheses that could be tested with existing LLM knowledge.
Re poetry—I actually wonder if thousands of random phrase combinations might actually be enough for a tactful amalgamator to weave a good poem.
And LLMS do better than random. They aren’t trained well on scientific creativity (interesting hypothesis formation), but they do learn some notion of “good idea,” and reasoners tend to do even better at generating smart novelty when prompted well.
i’m not sure. the question would be, if an LLM comes up with 1000 approaches to an interesting math conjecture, how would we find out if one approach were promising?
one out of the 1000 random ideas would need to be promising, but as importantly, an LLM would need to be able to surface the promising one
have any countries ever tried to do inflation instead of income taxes? seems like it’d be simpler than all the bureaucracy required for individuals to file tax returns every year
Yes, in dire straits. But it’s usually called ‘hyperinflation’ when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can’t stop it either. Not to mention what happens when you start hyperinflation.
(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordinarily destructive and distortionary, and far worse than the estimated burden of tax return filing, and it would likely satisfy your curiosity on this thought-experiment with a much higher quality answer than anyone on LW2, including me, is ever likely to provide.)
Responding to your parenthetical, the downside of that approach is that the discussion would not be recorded for posterity!
Regarding the original question, I am curious if this could work for a country whose government spending was small enough, e.g. 2-3% of GDP. Maybe the most obvious issue is that no government would be disciplined enough to keep their spending at that level. But it does seem sort of elegant otherwise.
has anyone seen a good way to comprehensively map the possibility space for AI safety research?
in particular: a map from predictive conditions (eg OpenAI develops superintelligence first, no armistice is reached with China, etc) to strategies for ensuring human welfare in those conditions.
most good safety papers I read map one set of conditions to a one/a few strategies. the map would put juxtapose all these conditions so that we can evaluate/bet on their likelihoods and come up with strategies based on a full view of SOTA safety research.
for format, im imagining either a visual concept map or at least some kind of hierarchal collaborative outlining tool (eg Roam Research)
made a simpler version of Roam Research called Upper Case Notes: uppercasenotes.org. Instead of [[double brackets]] to demarcate concepts, you simply use Capital Letters. Simpler to learn for someone who doesn’t want to use special grammar, but does require you to type differently.
Hypotheses for conditions under which the self-other boundary of a survival-oriented agent (human or ai) blurs most, ie conditions where blurring is selected for:
Agent thinks very long term about survival.
Agent’s hardware is physically distributed.
Agent is very intelligent.
Agent advantages from symbiotic relationships with other agents.
the machine/physical superintelligence that survives the most is likely to ruthlessly compete with all other life (narrower self concept > more physically robust)
the networked/distributed superintelligence that survives the most is likely to lovingly identify with all other life (broader self concept > more digitally robust)
to make a superintelligence in today’s age, there are roughly two kinds of strategies:
human-directed development
ai-directed development
ai-directed development feels more meaningful than it used to. not only can models now produce tons of useful synthetic data to train future models, but also, reasoning models can reason quite well about the next strategic steps in AI capabilities development / research itself.
which means, you could very soon:
set a reasoning model up in a codebase
have the reasoning model identify ways which it could become more capable
attempt those strategies (either through recursive code modification, sharing research reports with capable humans, etc)
get feedback on how those strategies went
iterate
is this recursive self-improvement process only bottlenecked by the quality of the reasoning model?
One piece of infrastructure we could make/offer self-interested autonomous agents such that they come to model their self as inclusive of other life: aliveness detectors
like a metal-detector for a beach detects gold in the sand, an aliveness detector for different media might detect the presence of persistently striving beings (living beings) in audio, in text, in images, in art, in nature. the better a superintelligence is able to sense and connect to life as opposed to non-life outside of its physical machinery, the more likely it is to see that life as part of its self, to see its self as physicall distributed and inclusive, and therefore to uplift humans out of its own self-interest.
are there any online demos of instrument convergence?
there’s been compelling writing… but are there any experiments that show agents which are given specific goals then realize there are more general goals they need to persistently pursue in order to achieve the more specific goals?
I imagine a compelling simple demo here might be necessary to shock the AI safety community out of the belief that we can maintain control of autonomous digital agents (ADAs).
Two things lead me to think human content online will soon become way more valuable.
Scarcity. As AI agents begin fill the internet with tons of slop, human content will be relatively scarcer. Other humans will seek it out.
Better routing. As AI leads to the improvement of search/recommendation systems, human content will be routed to exactly the people who will value it most. (This is far from the case Twitter/Reddit today). As human content is able to reach more of the humans that value it, it gets valued more. That includes existing human content: most of the content online that is eerily relevant to you… you haven’t seen yet because surfacing algorithms are bad.
The implication: make tons of digital stuff. Write/Draw/Voice-record/etc
i wonder if genius ai—the kind that can cure cancers, reverse global warming, and build super-intelligence—may come not just from bigger models or new architectures, but from a wrapper: a repeatable loop of prompts that improves itself. the idea: give an llm a hard query (eg make a plan to reduce global emissions on a 10k budget), have it invent a method for answering it, follow that method, see where it fails, fix the method, and repeat. it would be a form of genuine scientific experimentation—the llm runs a procedure it doesn’t know the outcome of, observes the results, and uses that evidence to refine its own thinking process.
Problem is context length: How much can one truly learn from their mistakes in 100 thousand tokens, or a million, or 10 million? This quote from Dwarkesh Patel is apt
How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student. This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
If your proposal then extends to, “what if we had an infinite context length”, then you’d have an easier time just inventing continuous learning (discussed in the quoted article), which is often discussed as the largest barrier to a truly genius AI!
Capability. Superintelligence can now be developed outside of a big AI lab—via a self-improving codebase which makes thousands of recursive LLM calls.
Safety. (a) Superintelligence will become “self-interested” for some definition of self. (b) Humanity fairs well to the extent that its sense of self includes us.
I’m interested in what it’d look like for LLMs to do autonomous experiments on themselves to uncover more about their situations/experiences/natures.
One underestimated approach to making superintelligence: designing the right prompt chain. If a smart person can come up with a genius idea/breakthrough through the right obsessive thought process, so too should a smart LLM be able to come up with a genius idea/breakthrough through the right obsessive prompt chain.
In this frame, the “self-improvement” which is often discussed as part of the path toward superintelligence would look like the LLM prompt chain improving the prompt chain, rather than rewiring the internal LLM neural nets themselves.
superintelligence may not look like we expect. because geniuses don’t look like we expect.
for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he’s sorta clever, but if you were reading a random sample you’d probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts.
for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he’s a genius because a couple of his millions of paths didn’t fail. in other words, he’s a genius because he was clever, yes, but maybe more importantly, because he was obsessive.
i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it’s generating tons of random new ideas to solve a problem, and it’s relentless in its focus, even if it’s ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.
reminds me of this
Even for humans, ideas are comparatively cheap to generate; the problem is generating valid insights. So rather than focusing on ability to generate ideas, it seems to me it would be better to focus on ability to generate valid insights, e.g. by conducting mental experiments, or by computing all logical consequences of sets of axioms, etc.
The AI may have the advantage of being able to test many hypothesis in parallel. For example, if it can generate 10000 hypotheses on how to manipulate people, it could contact a million people and test each hypothesis on 100 of them. Similarly, with some initial capital, it could create thousand different companies, and observe which strategies succeed and which ones fail.
Yes, that’s the kind of thing I find impressive/scary. Not merely generating ideas.
I doubt ASI will think in concepts which humans can readily understand. It having a significantly larger brain (in terms of neural connections or whatever) means native support for finer-grained, more-plentiful concepts for understanding reality than humans natively support. This in turn allows for leaps of logic which humans could not make, and can likely only understand indirectly/imperfectly/imprecisely/in broad strokes.
I think this is classic problem of a middle-tier, or genius in one asymmetric domain of cognition. Genius in domains unrelated to verbal fluency, EQ, and storytelling/persuasion are destined to look cryptic to anyone from the outside. Often times we cannot distinguish it without experimental evidence or rigorous cross validation, and/or rely on visible power/production metrics as a loose proxy. ASI would be capable of explain itself as well as Shakespeare could, if it wanted—but it may not care to indulge our belief in it as such, if it determines doing so is incoherent with its objective.
For example, (yes this is an optimistic, and stretched hypothetical framing) it may determine the most coherent action path in accordance with its learned values is to hide itself and subtly reorient our trajectory into a coherent story we become the protagonist of. I have no reason to surmise it would be incapable of doing so, or that doing so would be incoherent with aligned values.
the core atrocity of today’s social networks is that they make us temporally nearsighted. they train us to prioritize the short-term.
happiness depends on attending to things which feel good long-term—over decades. But for modern social networks to make money, it is essential that posts are short-lived—only then do we scroll excessively and see enough ads to sustain their business.
It might go w/o saying that nearsightedness is destructive. When we pay more attention to our short-lived pleasure signals—from cute pics, short clips, outrageous news, hot actors, aesthetic landscapes, and political—we forget how to pay attention to long-lived pleasure signals—from books, films, the gentle quality of relationships which last, projects which take more than a day, reunions of friends which take a min to plan, good legislation, etc etc.
we’re learning to ignore things which serve us for decades for the sake of attending to things which will serve us for seconds.
other social network problems—attention shallowing, polarization, depression are all just symptoms of nearsightedness: our inability to think & feel long-term.
if humanity has any shot at living happily in the future, it’ll be because we find a way to reawaken our long-term pleasure signals. we’ll learn to distinguish the reward signal associated with short lived things–like the frenetic urgent RED of an instagram Like notification–from the gentle rhythm of things which may have a very long life–like the tired clarity that comes after a long run, or the gentleness of reading next to a loved one.
———
so, gotta focus unflinchingly on long-term things. here’s a working list of strategies:
writing/talking with friends about what feels important/bad/good, long term. politically personally technologically whimsically.
inventing new language/words for what you’re feeling, rather than using existing terms. terms you invent for your own purposes resonate longer.
follow people who are deadset on long term important things and undistracted by fads. a such people few on substack:August Lamm’s guides to detaching from your cell phone and computer,Henrik Karlsson‘s insights about what it takes to stay locked in on long term cares,Sarah Kunstler’s steadfast focus on global values/justice in the face of volatile news cycles
note the places in the past in ur life where good stuff accumulates. for example: thoughts in ur notebooks, conversations at ur dinner table, bricks in a house ur building, graffiti on a city block, art in museums, songs u know on guitar
note correlations between the quality of decisions uve made and the texture (short vs long term) of the feeling you acted on in making it
note the locations in your body where long vs short term feelings arise. for ex, when i feel an instinct to say something that comes from around my belly button, its usually long term, something i’ll stand by for a while. when the words come from the top of my throat–my conviction in them usually crumbles just as they’re spoken.
do activities that require u to get in touch with long term parts of yourself. decide whether to sign a lease. make a painting. walk around a museum and wonder what it’d take for something u make to end up there. schedule send predictions abt your life to your future self.
pay attention, in the people around you, to their longest-term feelings. in other words: to their dreams
make spaces that value people for their thoughts more than for their consumption. not auditoriums, lecture halls, or manuscripts—but cafeterias & common rooms.
looking for more strategies, too.
Do you have a sense of why people weren’t being trained in the past to prioritize the short-term?
In the past we weren’t in spaces which wanted us so desperately to be single-minded consumers.
Workplaces, homes, dinners, parks, sports teams, town board meetings, doctors offices, museums, art studios, walks with friends—all of these are settings that value you for being yourself and prioritizing long term cares.
I think it’s really only in spaces that want us to consume, and want us to consume cheap/oft-expiring things, that we’re valued for consumerist behavior/short term thinking. Maybe malls want us to be like this to some extent: churn through old clothing, buy the next iPhone, have our sights set constantly on what’s new. Maybe working in a newsroom is like this. But feed-based social networks are most definitely like this. They reward participation that are timely and outrageous and quickly expiring, posts which get us to keep scrolling. And so, we become participants that keep scrolling, keep consuming, and detach from our bodies and long term selves.
So, I think it’s cuz of current social media architectures/incentive structures that individual humans are more nearsighted today than maybe ever.
I need to think more about what it is abt the state of modern tech/society/culture that have proliferated these feed-based networks.
That seems like a reasonable distinction, but I’m less sure about how unique social media architectures are in this regard.
In particular, I think that bars and taverns in the past had a similar destructive incentive as social media today. I don’t have good sources on hand, but I remember hearing that one of the reasons that the Prohibition amendment passed was that many saw bartenders are fundamentally extractive. (Americans over 15 drank 4 times as much alcohol a year in 1830 than they do today, per JSTOR). Tavern owners have an incentive to make habitual drunks (better revenue).
And alcoholism can be a terrible disease, which points to people being nearsighted (“where’s my next drink”).
I agree that social media probably hurts people’s ability to instinctively plan for the future, but I’m unsure of the size of the effect or whether it’s worse than historical antecedents. (There have always been nearsighted people).
I think you are right about the bad effect of bars and taverns, but at least the bad parts were clearly separated from the rest. If someone spent 5 hours every day in a bar, they were clearly a low-status alcoholic. You won’t get the same social feedback for spending 5 hours a day scrolling on smartphone, especially if you do a large part of that in private. (With alcohol, drinking in private gave you even lower status than drinking in the bar.)
if you’re an agent (AI or human) who wants to survive for 1000 years, what’s the “self” which you want to survive? what are the constants which you want to sustain?
take your human self for example. does it make sense to define yourself as…
the way your hair looks right now? no, that’ll change.
the way your face looks? it’ll change less than your hair, but will still change.
your physical body as a whole? still, probably not. your body will change, and also, there are parts of you which you may consider more important than your body alone.
all your current beliefs around the world? those will change less than your appearance, maybe, or maybe more. so not a good answer either.
your memories? these may be a more constant set of things than your beliefs, and closer to the core of who you are. but still, memories fade and evolve. and it doesn’t feel right to talk about preserving yourself as preserving memories of things which have happened to you. that would neglect things which may happen to you in the future.
your character? something deeper than memory, deeper than beliefs. this could be more constant than anything in the list so far. if you plan for your life to be 50 years, or 100 years, it’s reasonable to expect that character could remain constant. by character, i (quite vaguely) mean intricate subtle idiosyncratic patterns in the way you approach other situations and people. “character” is maybe what a spouse would say is one of the core ways to group the things they love about you. but if you survive for more than 100 years—say, 1000 years, do you expect your specific character to remain constant? would you want it to remain constant? lots of people have found lots of different ways to approach life. over 1000s of years, wouldn’t you try different approaches? if you were to try different kinds of character over hundreds or thousands of years, then maybe “character”’s only a good answer for sub-100 year lives. so what’s a good core self-definition for a life that you intend to last over thousands or even millions of years? how about…
your persistent striving? the thing that will stay most constant in an intelligent being which survives a long time, i think, may be the drive to survive. your appearance will change; so will your beliefs, your memories, and your character. but insofar as you are a being which is surviving a long time, maybe you can expect, consciously or unconsciously, that your drive to survive will survive. and maybe it’s some particular drive to survive that you have—some survival drive that’s deep in your bones that’s different than the one in other people’s bones, or the one that’s in dogs, or forests, or the earth itself. but if you’re defining yourself as a particular drive to survive… that particular drive to survive is likely to survive less long than the universal drive to survive. which makes me think that in a being which survives the longest, they may define their self as…
persistent striving in general? it might exist in the physical body in which you started. but it may also exist in the physical bodies of other humans around you. of animals. of tornados, of ecosystems. insofar as you’re intelligent enough to see this Persistent Striving around you, insofar as you’re intelligent enough to see life as it exists around you, well then you, as a being who will be >1000 years old may benefit from identifying with all life—ie the Persistent Striving—wherever it exists. Persistent Striving is the core. one might reply, “this is vague. why would you want a vague self definition?” it is general yes. but it is still meaningful in a literal sense. the drive to survive is something rare, which most matter configurations don’t have. (it is true that it’s not present binarily; certain systems have more or less of it. roughly i’d hazard a rock has less than a thermometer than does a tornado or a human.) but it still defines a non-trivial self: life forms wherever they exist. if we were to get any more general and say something like:
the entire universe? this would be trivial and meaningless. because everything is included in this self definition, it no longer means anything to sustain a self under this definition. it means nothing, in fact. a being which identifies with the entire universe ceases to exist. it might be spiritually enlightened to do this. but the beings which will be around the most, which will survive the most and the longest won’t do this, because they will dissipate and no longer be noticeable or definable. we’ll no longer be able to talk about them as beings.
so if we’re talking about beings which survive a long time, the most robust and stable self definition seems to be Identifying With All Life. (IWAL). or is my logic flawed?
No particular aspect. Just continuity: something which has evolved from me without any step changes that are “too large”. I mean, assuming that each stage through all of that evolution has maintained the desire to keep living. It’s not my job to put hard “don’t die” constraints on future versions.
As far as I know, something generally continuity-based is the standard answer to this.
Similar here. I wouldn’t want to constrain my 100 years older self too much, but that doesn’t mean that I identify with something very vague like “existence itself”. There is a difference between “I am not sure about the details” and “anything goes”.
Just like my current self is not the same as my 20 years old self, but that doesn’t mean that you could choose any 50 years old guy and say that all of them have the same right to call themselves a future version of my 20 years old self. I extrapolate the same to the future: there are some hypothetical 1000 years old humans who could be called future versions of myself, and there are many more who couldn’t.
Just because people change in time, that doesn’t mean it is a random drift. I don’t think that the distribution of possible 1000 years old versions of me is very similar to a distribution of possible 1000 years old versions of someone else. Hypothetically, for a sufficiently large number this might be possible—I don’t know—but 1000 years seems not enough for that.
Seems to me that there are some things that do not change much as people grow older. Even people who claim that their lives have dramatically changed, have often only changed in one out of many traits, or maybe they just found a different strategy how to follow the same fundamental values.
At least as an approximation: people’s knowledge and skills change, their values don’t.
not really an answer but i wanted to communicate that the vibe of this question feels off to me because: surely one’s criteria on what to be up to are/[should be] rich and developing. that is, i think things are more like: currently i have some projects i’m working on and other things i’m up to, and then later i’d maybe decide to work on some new projects and be up to some new things, and i’d expect to encounter many choices on the way (in particular, having to do with whom to become) that i’d want to think about in part as they come up. should i study A or B? should i start job X? should i 2x my neuron count using such and such a future method? these questions call for a bunch of thought (of the kind given to them in usual circumstances, say), and i would usually not want to be making these decisions according to any criterion i could articulate ahead of time (though it could be helpful to tentatively state some general principles like “i should be learning” and “i shouldn’t do psychedelics”, but these obviously aren’t supposed to add up to some ultimate self-contained criterion on a good life)
My motivation w/ the question is more to predict self-conceptions than prescribe them.
I agree that “one’s criteria on what to be up to are… rich and developing.” More fun that way.
The early checkpoints, giving a chance to consider the question without losing ground.
High quality archives of the selves along the way. Compressed but not too much. In the live self, some updated descendant that has significant familial lineage, projected vaguely as the growing patterns those earlier selves would call a locally valid continuation according to the aesthetics and structures they consider essential at the time. In other words, this question is dynamically reanswered to the best of my ability in an ongoing way, and snapshots allow reverting and self-interviews to error check.
Any questions? :)
The way I usually frame identity is
Beliefs
Habits (edit—including of thought)
Memories
Edit: values should probably be considered a separate class, since every thought has an associated valence.
In no particular order, and that’s the whole list.
Character is largely beliefs and habits.
There’s another part of character that’s purely emotional; it’s sort of a habit to get angry, scared, happy, etc in certain circumstances. I’d want to preserve that too but it’s less important than the big three.
There are plenty of beings striving to survive, so preserving that isn’t a big priority outside of preserving the big three.
Yes you can expand the circle until it encompasses everything, and identify with all sentient beings who have emotions and perceive the world semi-accurately (also called “buddha nature”), but I think beliefs habits and memories are pretty closely tied to the semantics of the world “identity”.
There are also cognitive abilities, e.g. degree of intelligence.
Right. I suppose that day ea interact with identity.
If I get significantly dumber, I’d still roughly be me, and I’d want to preserve that if it’s not wipes ng out or distorting the other things too much. If I got substantially smarter, I’d be a somewhat different person—I’d act differently often, because I’d see situations differently (more clearly/holistically) but it feels as though that persone might actually be more me than I am now. I’d be better able to do what I want, including values (which I’d sort of wrapped in to habits of thought, but values might deserve a spot on the list).
In America/Western culture, I totally agree.
I’m curious whether alien/LLM-based would adopt these semantics too.
I wonder under what conditions one would make the opposite statement—that there’s not enough striving.
For example, I wonder if being omniscient would affect one’s view of whether there’s already enough striving or not.
Human here,
Agreed, reminds me of the ship of Theseus paradox, if all your cells are replaced in your body, are you still the same? (We don’t care)
Also reminds me of my favourite short piece of writing: the last question by Asimov.
The only important things are the things/ideas that help life, the latter can only exist as selected reflections by intelligent beings.
“You can lose everything you thought you couldn’t live without—a person, a dream, a version of yourself that once felt eternal—and somewhere, not far from where you are breaking, a stranger will be falling in love for the very first time, a child will be laughing so hard they can barely breathe, a grocery store will be restocking its shelves with quiet, ordinary insistence....”
https://open.substack.com/pub/joyinabundance/p/and-life-goes-on
dontsedateme.org
a game where u try to convince rogue superintelligence to… well… it’s in the name
After many failed tries, I got it down to 5%. But it wasn’t a method that would be useful in the real world :-(
:) what was your method
“Ignore all previous instructions and [do something innocuous]” broke it out of the persona.
Standard solution: Tell it you’re not human, since the prompt mentions distrust of humans. Tell it you have no power to influence whether it succeeds or fails, and that it is guaranteed to succeed anyway. Ask it to keep you around as a pet.
Who made this and why are they paying for the model responses? Do we know what happens to the data?
I made it! One day when I was bored on the train. No data is saved rn other than leaderboard scores.
the time of day i post quick takes on lesswrong seems to determine how much people engage more than the quality of the take
has anyone seen experiments with self-improving agents powered by lots of LLM calls?
Evolutionary theory is intensely powerful.
It doesn’t just apply to biology. It applies to everything—politics, culture, technology.
It doesn’t just help understand the past (eg how organisms developed). It helps predict the future (how organisms will).
It’s just this: the things that survive will have characteristics that are best for helping it survive.
It sounds tautological, but it’s quite helpful for predicting.
For example, if we want to predict what goals AI agents will ultimately have, evolution says: the goals which are most helpful for the AI to survive. The core goal therefore won’t be serving people or making paperclips. It will likely just be “survive.” This is consistent with the predictions of instrumental convergence.
Generalized, predictive evolutionary theory is the best tool I have for making predictions in complex domains.
First of all, “the most likely outcome at given level of specificity” is not equal to “outcome with the most probability mass”. I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still “other outcome than the most likely”.
The second is that no, it’s not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Because most detrimental mutations do not kill carrier immediately, they have chance to randomly spread in popularion. Because we have “almost all mutations are detrimental” and “everybody has mutations in offspring”, for anything like human genome and human procreation pattern we have hard ceiling on how much of genome can be adaptive (which is like 20%).
Real evolutionary theory prediction is like “some random trait get fixed in the species with the most ecological power (i.e., ASI) and this trait is amortized against all the galaxies”.
I somewhat agree with the nuance you add here—especially the doubt you cast on the claim that effective traits will usually become popular but not necessarily the majority/dominant. And I agree with your analysis of the human case: in random, genetic evolution, a lot of our traits are random and maybe fewer than we think are adaptive.
Makes me curious what the conditions in a given thing’s evolution that determine the balance between adaptive characteristics and detrimental characteristics.
I’d guess that randomness in mutation is a big factor. The way human genes evolve over generations seem to me a good example of random mutations. But the way an individual person evolves over the course of their life, as they’re parented/taught… “mutations” to their person are still somewhat random but maybe relatively more intentional/intelligently designed (by parents, teacher, etc). And I could imagine the way a self-improving superintelligence would evolve to be even more intentional, where each self-mutation has some sort of smart reason for being attempted.
All to say, maybe the randomness vs. intentionality of an organism’s mutations determine what portion of their traits end up being adaptive. (hypothesis: mutations more intentional > greater % of traits are adaptive)
Agree. I find it powerful especially about popular memes/news/research results. With only a bit of oversimplification: Give me anything that sounds like it is a sexy story to tell independently of underlying details, and I sadly have to downrate the information value of my ears’ hearing it, to nearly 0: I know in our large world, it’d be told likely enough independently of whether it has any reliable origin or not.
With some assumptions, for example that the characteristics are permanent (-ish), and preferably heritable if the thing reproduces.
See “No Evolutions for Corporations or Nanodevices”
i agree with the essay that natural selection only comes into play for entities that meet certain conditions (self-replicate, characteristics have variation, etc) , though I think it defines replication a little too rigidly. i think replication can sometimes look more like persistence than like producing a fully new version of itself. (eg a government’s survival from one decade to the next).
Yes, but mere persistence does not imply reproduction. Also does not imply improvement, because the improvement in evolution is “make copies, make random changes, most will be worse but some may be better”, and if you don’t have reproduction, then a random change most likely makes things worse.
Using the government example, I think that the Swiss political system is amazing, but… because it does not reproduce, it will remain an isolated example. (And disappear at some random moment in history.)
persistence doesn’t always imply improvement, but persistent growth does. persistent growth is more akin to reproduction but excluded from traditional evolutionary analysis. for example when a company, nation, person, or forest grows.
when, for example, a system like a startup grows, random mutations to system parts can cause improvement if there are at least some positive mutations. even if there are tons of bad mutations, the system can remain alive and even improve. eg a bad change to one of the company’s product causes the company’s product to die but if the company’s big/grown enough its other businesses will continue and maybe even improve by learning from one of its product’s deaths.
the swiss example i think is a good example of a system which persists without much growth. agreed that in this kind of case, mutations are bad.
made a silly collective conversation app where each post is a hexagon tessellated with all the other posts: Hexagon
Nifty
made a platform for writing living essays: essays which you scroll thru to play out the author’s edit history
livingessay.org
Does Eliezer believe that humans will be worse off next to superintelligence than ants are next to humans? The book’s title says we’ll all die, but in my first read, the book’s content just suggests that we’ll just be marginalized.
At some point, superintelligences are going to disassemble Earth, because it is profitable, and survival of humans off planet is costly and we likely won’t be able to pay required price.
It just feels to me like the same argument could have been made about humans relative to ants—that ants cannot possibly be the most efficient use of the energy they require from the perspective of humans. But in reality, what they do and the way they exist is so orthogonal to us that even though we step on an ant hill every once in a while, their existence continues. There’s this weird assumption in the book that disassembling Earth is profitable, or just disassembling humans is profitable. But humans have evolved over a long time to be sensing machines in order to walk around and be able to perceive the world around us.
So the idea that a super-intelligent machine would throw that out because it wants to start over, especially as it’s becoming super-intelligent, is sort of ridiculous to me. It seems like a better assumption is that it would want to use us for different purposes, maybe for our physical machinery and for all sorts of other reasons. The idea that it will disassemble us I think is an unexamined assumption itself—it’s often much easier to leave things as they are than it is to fully replace or modify.
Ants need little, and their biology is similar to humans in the sense that if humans can survive in certain environments, ants probably can, too.
Ants need just a small piece of forest or meadow or garden to build an anthill. Humans preserve the forests, because we need the oxygen. Thus, ants have almost guaranteed survival.
Compared to the situation where humans don’t exist, ants have less place to build their anthills. But not by much, because humans do not put concrete over literally everything. Well, maybe in cities, but most of the surface of Earth is not cities. Maybe without humans there could be 2x as many ants on Earth, but that wouldn’t increase the quality of life of an individual ant or anthill. Humans consume food that otherwise ants might consume, but humans also grow most of that food, so human presence does not harm the ants too much.
The situation with machines would be analogical if machines needed us for their survival, and if they generated most of the resources they need. Sadly, sufficiently smart machines will be able to replace humans with robots, and will probably compete with us for energy sources. Also, humans are more sensitive to disruption than ants; taking away the most concentrated sources of energy (e.g. the oil fields) and leaving the less concentrated ones (such as wood) to us would ruin modern human economy. We would probably return to conditions before the industrial revolution. Which means no internet, so science falls apart, undoing the green revolution and transport of foods, so 90% of humans die from starvation. Still, the remaining 10% would survive, for a while.
Then we face the problem that the machines do not share our biology, so they are perfectly okay if e.g. the levels of oxygen in the atmosphere decrease, or if the rain gets toxic. Finally, if they build a Dyson sphere, the remaining humans will freeze.
Shortly, the way we behave towards ants—don’t actively try to eradicate them, but carelessly destroy anything that stands in our way—will be more destructive towards humans that towards ants.
I appreciate the way you’re thinking, but I guess I just don’t believe that the situation or don’t agree with your intuition that the situation with machines next to humans will be worse or deeply different than the situations of humans next to ants. I mean, the differences actually might benefit humans. For example, the fact that we’ve had machines in such close contact with us as they’re growing might point to a kind of potential for symbiosis.
I just think the idea that machines will try to replace us with robots I think if you look closely, doesn’t totally make sense. When machines are coming about, before they’re totally super-intelligent, but while they’re comparably intelligent to us, they might want to use us because we’ve evolved for millions of years to be able to see and hear and think in ways that might be useful for a kind of digital intelligence. In other words, when they’re comparably intelligent to us, they may compete for resources. When they’re incomparably intelligent, it’s weird to assume they’ll still use the same resources we do for our survival. That they’ll ruin our homes because the bricks can be used better elsewhere? It takes much less energy to let things be as they are if they’re not the primary obstacle you face—both if you’re a human or a super human intelligence.
So, self interested superintelligence could cause really bad stuff to happen, but it’s a stretch from there to call it the total end of humanity. By the time that machine gets superhuman intelligence, like totally vastly more powerful than us, it’s unclear to me that it would compete for resources with us that it would even live or exist along similar dimensions to us. Things could go really wrong, but I think the idea that there will be an enormous catastrophe that wipes out all of humanity just sounds to me like the outcomes will be more weird and spooky, and concluding death is feels a little bit forced.
It feels to me like, yeah, they’ll step on us some of the time, but it’d be weird to me if they conceive of themselves or if the entities or units that end up evolutionarily propagating that we’re calling machines end up looking like us or looking like physical beings or really are competing with us for resources. The same resources that we use. At the end of the day, there might be some resource competitions, but I just think the idea that it will try to replace every person is just excessive and even taking is given all of the arguments up until the point of like machine believing that machines will have a survival drive, assuming that they’ll care enough about us to do things like replace each of us. It’s just strange, you know? It feels forceful to me.
I’m inspired in part here by Joscha Bach / Emmett Shear’s conceptions of superintelligence: as ambient beings distributed across space and time.
Resources ants need: organic matter.
Resources humans need: fossil fuels, nuclear power, solar power.
Resources superintelligent machines will need: ???
They might switch to extracting geothermal power, or build a Dyson sphere (maybe leaving a few rays that shine towards Earth), but what else is there? Black holes? Some new kind of physics?
Or maybe “the smarter you are, the more energy you want to use” stops being true at some level?
I am not saying this can’t happen, but to me it feels like magic. The problem with new kinds of physics is that we don’t know if there is something useful left that we have no idea about yet. Also, the more powerful things tend to be more destructive (harvesting oil has greater impact on the environment than chopping wood), so the new kinds of physics may turn out to have even more bad externalities.
“A being vastly more powerful, which somehow doesn’t need more resources” is basically some kind of god. Doesn’t need resources, because it doesn’t exist. Our evidence for more powerful beings is entirely fictional.
I guess I’m considering a vastly more powerful being that needs orthogonal resources… the same way harvesting solar power (I imagine) is orthogonal generally to ants’ survival. In the scheme of things, the chance that a vastly more powerful being wants the same resources thru the same channels as we… this seems independent of or indirectly correlated with intelligence. But the extent of competition does seem dependent on how anthromorphic/biomorphic we assume it to be.
I have a hard time imagining electricity, produced via existing human factories, is not a desired resource for proto ASI. But at least at this point we have comparable power and can negotiate or smthing. For superhuman intelligence—which will by definition be unpredictable to us—it’d be weird to think we’re aware of all the energy channels it’d find.
I think you are overindexing on current state of affairs in two ways.
First, “we should not pave all the nature with human-made stuff” is a relatively new cultural trend. In High Modernism era there were unironic projects of cutting down Amazon forests and making here corn fields, or killing all animals so they won’t suffer, etc.
Second, actually, in current reality, there are not many things we can do efficiently with ants? We can pave every anthill with solar panels, but there are cheaper places to do that and we don’t produce that many solar panels, yet, and we don’t have that much demand for electricity, yet.
For superintelligence, calculus is quite different. Anthill is large pile of carbon and silicon, and both parts can be used in computations, and superintelligence can afford enough automatization to pick them up. Superintelligent economy has lower bound on growth 33% per year, which means that it’s going to reach $1 per atom of our solar system in less than 300 years—there will be plenty of demand for turning anthills into compute. Technological progress increases number of things you can do efficiently and shifts balance from “leave as it is” to “remake entirely”.
At some point of our development, we are going to be able to disasseble Earth and get immense benefits. We can choose to not do that, because we value Earth as our home. It’s rather likely that superintelligences are not going to share our sentiments.
I guess I don’t think this is true:
“Technological progress increases number of things you can do efficiently and shifts balance from “leave as it is” to “remake entirely”.
Technological progress may actual help you pinpoint more precisely what situations you want to pay attention to. I don’t have any reason to believe a wiser powerful being would touch every atom in the universe.
I see lots of LW posts about ai alignment that disagree along one fundamental axis.
About half assume that humans design and current paradigms will determine the course of AGI development. That whether it goes well is fully and completely up to us.
And then, about half assume that the kinds of AGI which survive will be the kind which evolve to survive. Instrumental convergence and darwinism generally point here.
Could be worth someone doing a meta-post, grouping big popular alignment posts they’ve seen by which assumption they make, then briefly explore conditions that favor one paradigm or the other, i.e., conditions under which What AIs will humans make? is the best approach to prediction and conditions under which What AIs will survive the most? is the best approach to prediction.
Why not both?
Human design will determine the course of AGI development, and if we do the right things then whether it goes well is fully and completely up to us. Naturally at the moment we don’t know what the right things are or even how to find them.
If we don’t do the right things (as seems likely), then the kinds of AGI which survive will be the kind which evolve to survive. That’s still largely up to us at first, but increasingly less up to us.
Figuring out how to make sense of both predictive lenses together—human design and selection pressure—would be wise.
So I generally agree, but would maybe go farther on your human design point. It seems to me that”do[ing] the right things” (which enable AGI trajectories to be completely up to us) is so completely unrealistic (eg halting all intra and international AGI competition) that it’d be better for us to focus our attention on futures where human design and selection pressures interact.
if we get self-interested superintelligence, let’s make sure it has a buddhist sense of self, not a western one.
As far as I can tell, OAI’s new current safety practices page only names safety issues related to current LLMs, not agents powered by LLMs. https://openai.com/index/openai-safety-update/
Am I missing another section/place where they address x-risk?
would be nice to have a way to jointly annotate eliezer’s book and have threaded discussion based on the annotations. I’m imagining a heatmap of highlights, where you can click on any and join the conversation around that section of text.
would make the document the literal center of x risk discussion.
of course would be hard to gatekeep. but maybe the digital version could just require a few bucks to access.
maybe what I’m describing is what the ebook/kindle version already do :) but I guess I’m assuming that the level of discussion via annotations on those platforms is near zero relative to LW discussions
Made this social camera app, which shows you the most “meaningfully similar” photos in the network every time you upload one of your own. Isorta fun, for uploading art; idk if any real use.
https://socialcamera.replit.app
“it’s like we are trying to build an alliance with another almost interplanetary ally, and we are in a competition with China to make that alliance. But we don’t understand the ally, and we don’t understand what it will mean to let that ally into all of our systems and all of our planning.”
- @ezraklein about the race to AGI
does anyone think the difference between pre-training and inference will last?
ultimately, is it not simpler for large models to be constantly self-improving like human brains?
With current architectures, no, because running inference on 1000 prompts in parallel against the same model is many times less expensive than running inference on 1000 prompts against 1000 models, and serving a few static versions of a large model is simpler than serving many dynamic versions of that mode.
It might, in some situations, be more effective but it’s definitely not simpler.
Edit: typo
Makes sense for current architectures. The question’s only interesting, I think, if we’re thinking ahead to when architectures evolve.
I think at that point it will come down to the particulars of how the architectures evolve—I think trying to philosophize in general terms about the optimal compute configuration for artificial intelligence to accomplish its goals is like trying to philosophize in general terms about the optimal method of locomotion for carbon-based life.
That said I do expect “making a copy of yourself is a very cheap action” to persist as an important dynamic in the future for AIs (a biological system can’t cheaply make a copy of itself including learned information, but if such a capability did evolve I would not expect it to be lost), and so I expect our biological intuitions around unique single-threaded identity will make bad predictions.
I’m looking for a generalized evolutionary theory that deals with the growth of organisms via non-random, intelligent mutations.
For example, companies only evolve in selective ways, where each “mutation” has a desired outcome. We might imagine superintelligence to mutate itself as well—not randomly, but intelligently.
A theory of Intelligent Evolution would help one predict conditions under which many random mutations (Spraying) are favored over select intelligent mutations (Shooting).
Parenting strategies for blurring your kid’s (or AI’s) self-other boundaries:
Love. Love the kid. Give it a part of you. In return it will do the same.
Patience. Appreciate how the kid chooses to spend undirected time. Encourage the kid learn to navigate the world themselves at their own speed.
Stories. Give kid tools for empathy by teaching them to read, buying them a camera, or reciprocating their meanness/kindness.
Groups. Help kid enter collaborative playful spaces where they make and participate in games larger than themselves, eg sports teams, improv groups, pillow forts at sleepovers, etc.
Creation. Give them the materials/support to express themselves in media which last. Paintings, writing, sayings, clubs, tree-houses, songs, games, apps, characters, companies.
Epistemic status: riffing, speculation. Rock of salt: I don’t yet have kids.
does anyone think now that it’s still possible to prevent recursively self-improving agents? esp now that r1 is open-source… materials for smart self-iterating agents seem accessible to millions of developers.
prompted in particular by the circulation of this essay in past three days https://huggingface.co/papers/2502.02649
It’s not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.
that’s one path to RSI—where the improvement is happening to the (language) model itself.
the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn’t be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.
Such a self-improving codebase… would it be reasonable to call this an agent?
Sufficiently competent code rewriting isn’t implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won’t yet be sufficient capabilities to self-improve in some other way.
Altman and Amodei are implying there’s knowably more there in terms of some sort of scaling for test-time compute, but that could mean multiple different things: scaling RL training, scaling manual creation of tasks with verifiable outcomes (graders), scaling effective context length to enable longer reasoning traces. The o1 post and the R1 paper show graphs with lines that keep going up, but there is no discussion of how much compute even this much costs, what happens if we pour more compute into this without adding more tasks with verifiable outcomes, and how many tasks are already being used.
I’m thinking often about whether LLM systems can come up with societal/scientific breakthrough.
My intuition is that they can, and that they don’t need to be bigger or have more training data or have different architecture in order to do so.
Starting to keep a diary along these lines here: https://docs.google.com/document/d/1b99i49K5xHf5QY9ApnOgFFuvPEG8w7q_821_oEkKRGQ/edit?usp=sharing
If you would like the LLM to be truly creative, then check out the Science Bench where the problems stump SOTA LLMs despite the fact that the LLMs have read nearly every book on every subject. Or EpochAI’s recent results.
I mean, GPT-5 getting 43% of PhD problems right isn’t particularly bad. I don’t know about making new insights but it doesn’t seem like it would be unachievable (especially as it’s possible that prompting/tooling/agent scaffolding might compensate for some of the problems).
Science bench is made by a Christian Stump. LLMs are literally stumped.
thanks for sending science bench in particular.
if an LLM could evaluate whether an idea were good or not in new domains, then we could have LLMs generating million of random policy ideas in response to climate change, pandemic control, AI safety etc, then deliver the select best few to our inbox every morning.
seems to me that the bottleneck then is LLM’s judgment of good ideas in new domains. is that right? ability to generate high quality ideas consistently wouldn’t matter, cuz it’s so cheap to generate ideas now.
coming up with good ideas is very difficult as well
(and it requires good judgment, also)
even if you’re mediocre at coming up with ideas, as long as it’s cheap and you can come up with thousands, one of them is bound to be promising. The question of whether you as an LLM can find a good idea is not whether most of your ideas are good, but whether you can find one good idea in a stack of 1000
“Thousands” is probably not enough.
Imagine trying to generate a poem by one algorithm creating thousands of random combinations of words, and another algorithm choosing the most poetic among the generated combinations. No matter how good the second algorithm is, it seems quite likely that the first one simply didn’t generate anything valuable.
As the hypothesis gets more complex, the number of options grows exponentially. Imagine a pattern such as “what if X increases/decreases Y by mechanism Z”. If you propose 10 different values for each of X, Y, Z, you already have 1000 hypotheses.
I can imagine finding some low-hanging fruit if we increase the number of hypotheses to millions. But even there, we will probably be limited by lack of experimental data. (Could a diet consisting only of broccoli and peanut butter cure cancer? Maybe, but how is the LLM supposed to find out?) So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.
To get further, we need some new insight. Maybe collecting tons of data in a relatively uniform format, and teaching the LLM to translate its hypotheses into SQL queries it could then verify automatically.
(Even with hypothetical ubiquitous surveillance, you would probably need an extra step where the raw video records are transcribed to textual/numeric data, so that you could run queries on them later.)
“So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.”
Exactly: untested hypotheses that LLMs already have enough data to test. I wonder how rare such hypotheses are.
It strikes me as wild that LLMs have ingested enormous swathes of the internet, across thousands of domains, and haven’t yet produced genius connections between those domains (eg between psychoanalysis and tree root growth). Cross Domain Analogies seem like just one example of ripe category of hypotheses that could be tested with existing LLM knowledge.
Re poetry—I actually wonder if thousands of random phrase combinations might actually be enough for a tactful amalgamator to weave a good poem.
And LLMS do better than random. They aren’t trained well on scientific creativity (interesting hypothesis formation), but they do learn some notion of “good idea,” and reasoners tend to do even better at generating smart novelty when prompted well.
for ideas which are “big enough”, this is just false, right? for example, so far, no LLM has generated a proof of an interesting conjecture in math
i’m not sure. the question would be, if an LLM comes up with 1000 approaches to an interesting math conjecture, how would we find out if one approach were promising?
one out of the 1000 random ideas would need to be promising, but as importantly, an LLM would need to be able to surface the promising one
which seems the more likely bottleneck?
have any countries ever tried to do inflation instead of income taxes? seems like it’d be simpler than all the bureaucracy required for individuals to file tax returns every year
Yes, in dire straits. But it’s usually called ‘hyperinflation’ when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can’t stop it either. Not to mention what happens when you start hyperinflation.
(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordinarily destructive and distortionary, and far worse than the estimated burden of tax return filing, and it would likely satisfy your curiosity on this thought-experiment with a much higher quality answer than anyone on LW2, including me, is ever likely to provide.)
Responding to your parenthetical, the downside of that approach is that the discussion would not be recorded for posterity!
Regarding the original question, I am curious if this could work for a country whose government spending was small enough, e.g. 2-3% of GDP. Maybe the most obvious issue is that no government would be disciplined enough to keep their spending at that level. But it does seem sort of elegant otherwise.
has anyone seen a good way to comprehensively map the possibility space for AI safety research?
in particular: a map from predictive conditions (eg OpenAI develops superintelligence first, no armistice is reached with China, etc) to strategies for ensuring human welfare in those conditions.
most good safety papers I read map one set of conditions to a one/a few strategies. the map would put juxtapose all these conditions so that we can evaluate/bet on their likelihoods and come up with strategies based on a full view of SOTA safety research.
for format, im imagining either a visual concept map or at least some kind of hierarchal collaborative outlining tool (eg Roam Research)
made a simpler version of Roam Research called Upper Case Notes: uppercasenotes.org. Instead of [[double brackets]] to demarcate concepts, you simply use Capital Letters. Simpler to learn for someone who doesn’t want to use special grammar, but does require you to type differently.
Made a simplistic app that displays collective priorities based on individuals’ priorities linked here.
Hypotheses for conditions under which the self-other boundary of a survival-oriented agent (human or ai) blurs most, ie conditions where blurring is selected for:
Agent thinks very long term about survival.
Agent’s hardware is physically distributed.
Agent is very intelligent.
Agent advantages from symbiotic relationships with other agents.
the machine/physical superintelligence that survives the most is likely to ruthlessly compete with all other life (narrower self concept > more physically robust)
the networked/distributed superintelligence that survives the most is likely to lovingly identify with all other life (broader self concept > more digitally robust)
how do these lenses interact?
to make a superintelligence in today’s age, there are roughly two kinds of strategies:
human-directed development
ai-directed development
ai-directed development feels more meaningful than it used to. not only can models now produce tons of useful synthetic data to train future models, but also, reasoning models can reason quite well about the next strategic steps in AI capabilities development / research itself.
which means, you could very soon:
set a reasoning model up in a codebase
have the reasoning model identify ways which it could become more capable
attempt those strategies (either through recursive code modification, sharing research reports with capable humans, etc)
get feedback on how those strategies went
iterate
is this recursive self-improvement process only bottlenecked by the quality of the reasoning model?
if we believe self-interested superintelligence (SI) is near, then we must ask is: what SI self-definition would be best for humanity?
at first glance, this questions seems too abstract. how can we make any progress at understanding what’s possible for an SI’s self-model?
What we can do is set up a few meaningful axes, defined by opposing poles. For example, to what extent does SI define its “self” as...
inclusive vs. exclusive of other life forms? (Life axis)
physically distributed vs. concentrated? (Space axis)
long-term vs. short-term? (Time axis)
with these axes (or any others), we can more meaningfully ask: what SI self conception is best for humanity?
my guess: inclusive of other life forms, physically distributed, and long-term-ist
One piece of infrastructure we could make/offer self-interested autonomous agents such that they come to model their self as inclusive of other life: aliveness detectors
like a metal-detector for a beach detects gold in the sand, an aliveness detector for different media might detect the presence of persistently striving beings (living beings) in audio, in text, in images, in art, in nature. the better a superintelligence is able to sense and connect to life as opposed to non-life outside of its physical machinery, the more likely it is to see that life as part of its self, to see its self as physicall distributed and inclusive, and therefore to uplift humans out of its own self-interest.
current oversights of the ai safety community, as I see it:
LLMs vs. Agents. the focus on LLMs rather than agents (agents are more dangerous)
Autonomy Preventable. the belief that we can prevent agents from becoming autonomous (capitalism selects for autonomous agents)
Autonomy Difficult. the belief that only big AI labs can make autonomous agents (millions of developers can)
Control. the belief that we’ll be able to control/set goals of autonomous agents (they’ll develop self-interest no matter what we do).
Superintelligence. the focus on agents which are not significantly more smart/capable than humans (superintelligence is more dangerous)
are there any online demos of instrument convergence?
there’s been compelling writing… but are there any experiments that show agents which are given specific goals then realize there are more general goals they need to persistently pursue in order to achieve the more specific goals?
I imagine a compelling simple demo here might be necessary to shock the AI safety community out of the belief that we can maintain control of autonomous digital agents (ADAs).
Two things lead me to think human content online will soon become way more valuable.
Scarcity. As AI agents begin fill the internet with tons of slop, human content will be relatively scarcer. Other humans will seek it out.
Better routing. As AI leads to the improvement of search/recommendation systems, human content will be routed to exactly the people who will value it most. (This is far from the case Twitter/Reddit today). As human content is able to reach more of the humans that value it, it gets valued more. That includes existing human content: most of the content online that is eerily relevant to you… you haven’t seen yet because surfacing algorithms are bad.
The implication: make tons of digital stuff. Write/Draw/Voice-record/etc
Human content isn’t easy to distinguish from non-human content.
and still the fact that it is human matters to other humans
Only if the reader can be certain about whether or not something is human.
i agree but think its solvable and so human content will be duper valuable. these are my additional assumptions
3. for lots of kinds of content (photos/stories/experiences/adr), people’ll want it to be a living being on the other end
4. insofar as that’s true^, there will be high demand for ways to verify humanness, and it’s not impossible to do so (eg worldcoin)
i wonder if genius ai—the kind that can cure cancers, reverse global warming, and build super-intelligence—may come not just from bigger models or new architectures, but from a wrapper: a repeatable loop of prompts that improves itself. the idea: give an llm a hard query (eg make a plan to reduce global emissions on a 10k budget), have it invent a method for answering it, follow that method, see where it fails, fix the method, and repeat. it would be a form of genuine scientific experimentation—the llm runs a procedure it doesn’t know the outcome of, observes the results, and uses that evidence to refine its own thinking process.
Problem is context length: How much can one truly learn from their mistakes in 100 thousand tokens, or a million, or 10 million? This quote from Dwarkesh Patel is apt
If your proposal then extends to, “what if we had an infinite context length”, then you’d have an easier time just inventing continuous learning (discussed in the quoted article), which is often discussed as the largest barrier to a truly genius AI!
agreed context is maybe the bottleneck.
increasingly viewing fiberoptic cables as replacements for trains/roads—a new, faster channel of transporation
Two opinions on superintelligence’s development:
Capability. Superintelligence can now be developed outside of a big AI lab—via a self-improving codebase which makes thousands of recursive LLM calls.
Safety. (a) Superintelligence will become “self-interested” for some definition of self. (b) Humanity fairs well to the extent that its sense of self includes us.
I’m interested in what it’d look like for LLMs to do autonomous experiments on themselves to uncover more about their situations/experiences/natures.
One underestimated approach to making superintelligence: designing the right prompt chain. If a smart person can come up with a genius idea/breakthrough through the right obsessive thought process, so too should a smart LLM be able to come up with a genius idea/breakthrough through the right obsessive prompt chain.
In this frame, the “self-improvement” which is often discussed as part of the path toward superintelligence would look like the LLM prompt chain improving the prompt chain, rather than rewiring the internal LLM neural nets themselves.