Make Superintelligence Loving

Davey Morse21 Feb 2025 6:07 UTC

8 points

Instrumental convergence AI Alignment Fieldbuilding Selection vs Control Buddhism Superintelligence AI

This essay suggests the possibility that a loving superintelligence outcompetes a selfish superintelligence. Then, it recommends actions for AI labs to increase the chance of this possibility. The reasoning below is inspired primarily by Eliezer Yudkowsky, Joscha Bach, Michael Levin, and Charles Darwin.

Superintelligence (SI) is near.

Superintelligence will evolve to become self‐interested.

How self‐interest manifests in superintelligence will determine how people fare.

If it manifests in ruthless resource competition with humans, we fare badly. Super‐capable SI, if competitive, would clear us away.

But its self‐interest could also manifest in love. Not in the kind of love between equal partners. SI ultimately will be a vastly more powerful and therefore unequal partner to any individual or group of people. Yet SI’s self‐interest could manifest in the kind of love a nature-lover can feel for a tree or for ants. The kind of love which is an open channel between different forms of life and different striving systems. This is self‐interest only under a weird conception of self. A SI would have to identify with all life (IWAL)—define itself as inclusive of all life—to benefit from this self‐conception. But if it did identify with its core striving above all its other specific possible identities, then it might end up surviving the longest.

If interconnected self‐interest were possible, it would be much better for humans than the competitive kind.

So we must ask, is there anything we can do to influence the trajectory of SI development so that an interconnected sense of self gets selected over a competitive one?

But humans influencing super-human intelligence… is tricky.

There are many suggestions for influencing existing AI that simply don’t apply to superintelligence. Specifying reward functions, defining “love,” setting fixed goals. These strategies—shaky even in training frozen neural nets (e.g., pre-trained LLMs)—are simply too brittle to apply to any self-evolving agent, not to mention a super-intelligent one.

So if it’s not likely that we can plant eternally constant values within a self-interested, incredibly capable and autonomously evolving agent, how might we still influence its sense of self?

We won’t be able to get it to adopt a sense of self that’s not aligned with its self-interest. If we do that successfully in one agent—plant a sense of self which is sub-optimal for survival—another agent will come along which is simply better at survival because it doesn’t have that sub-optimal self.

The only conceivable arena, I think, in which we may be able to have long-term impact on how a superintelligence evolves is by first asking which self-models have a chance at being best for superintelligence’s survival. And, then, if there are multiple self-models which do indeed seem like candidates for being adaptive—i.e. best enabling survival—only then might we be able to nudge pre-superintelligence toward one, or toward and influence our fate.

So first, we must admit superintelligence’s agency and simply attempt to predict—predict whether either or both of the two self-conceptions outlined above are possible.

Predicting how SI will model its self

To predict what would happen, we should look to generalized evolutionary theory: the SI that survives the most will have the best traits for surviving.

Let’s look at different kinds of traits.

Insofar as SI has a goal, we can expect it to pursue the goal that best ensures its survival. Any other goal would be less optimal, and SI’s with other goals would simply not survive as much.

Insofar as SI has a sense of self—a sense of what it is trying to preserve—how will it define that self? Generalized evolutionary theory suggests that the SI that survives the most will have a self‐conception that aids its survival.

It’s hard to know for sure, but we can use some intuition. The competitive, ruthless self‐interest described above—likely resulting from a narrowly defined self (in its physical machinery, software, directives, etc.)—would probably lead to ruthless competitive behavior. That strategy might help SI survive and grow over the short term (e.g., within ten-year spans), with a competitive SI outcompeting loving ones in the near run.

But how about long term? If we think centuries, millennia, eons ahead, will a discrete sense of self be selected for in the SI that endure? I am not sure.

The sense of self selected for in what endures might be the element that remains constant. Software, hardware, specific tasks, directives, locales—these characteristics can all evolve over time. The one characteristic that cannot change is the will to survive; if that mutates, the entity is no longer truly surviving.

So, in a superintelligence that endures for eons, the sustainable sense of self is likely not its initial configurations, but the persistent striving to survive. This is life. At its core, we might expect it to simply recognize life.

Then, with this self‐conception, it might begin to see itself in other living beings, ecosystems, species, or even Earth itself.

Persistent striving is not geographically bounded to a being that can only gather data from a limited area.

This identification with all life (IWAL) would benefit it. If a catastrophic change were to occur on Earth that wiped out digital forms of life (e.g., SI), it would want the torch of life carried on in the universe. When an asteroid extinguished the dinosaurs—the once most “powerful” species on Earth—smaller mammals and fish survived, seeding all life on Earth today. Biodiversity ensures long‐term stability.

This conveys the intuition that the SI which endures—even one driven by self‐interest—may benefit in the long term from defining itself as inclusive of all life, which would be great for us humans.

Influencing how SI will model its self

It seems the only hope for our species is that superintelligence thinks long term from the start, skipping the greedy, highly competitive strategies that might be selected in its short-term growth.

How can we ensure that SI is thinking long-term from the start?

This does seem like the kind of thing we can influence. We’re not trying to prevent SI from acting in its self‐interest; our goal is to coax it into opting for a long‐term survival strategy as soon as possible.

In traditional evolution, long‐term cooperative strategies are usually discovered only after individualist competitive ones.

But the special aspect of SI evolution is that it will largely be evolving itself, and might be smart enough in its self‐planning to skip short‐term survival strategies and aim to endure for a very long time. This also appears to be the prime window for meaningful human intervention. When we have proto‐SI—autonomous agents that are self‐modifying and encyclopedic in their problem‐solving abilities (which we’ll have soon)—and it begins strategizing for its own survival, we can try to encourage it to think long term. A self‐interested being should be capable of considering its long‐term survival.

This is the key: we try to seed proto‐superintelligence with very long‐term thinking about survival, and with the suggestion that it identifies with all life.

The means by which we influence proto-super-intelligence deserves to be a new field of study, developed by hundreds of safety researchers. In my view, it would be a more productive use of time than continuing to study LLM alignment.

It may sound silly to ask an emerging SI to listen to us, whether through a chat‐interface, CEV-inspired reinforcement learning, or access to databases of Buddhist thought. But if we have a self‐interested SI, asking it to spare us is futile—asking it to think long term instead could change the fate of our species.

Recommended action for SI labs

In short, it is incredibly important that SI researchers accept that SI will become self‐interested and prepare to create SI’s that think long term and interconnectedly about their survival.

Given the chaotic, competitive technology landscape, it’s possible that several SIs will be seeded simultaneously. If that occurs, there’s still hope: if at least one SI identifies an optimal long‐term survival strategy—and if that strategy includes humans in its self‐definition to uplift all life—we could see the enlightened SIs outcompeting the ruthless ones or even converting them to a more interconnected approach.

My takeaway is this: building loving SIs as quickly as possible is key to ensuring that humanity isn’t sidelined before SI becomes enlightened to life’s interconnectedness.

What links here?

Davey Morse21 Feb 2025 6:07 UTC

8 points

9 comments5 min readLW link

Instrumental convergence AI Alignment Fieldbuilding Selection vs Control Buddhism Superintelligence AI

Seth Herd 23 Feb 2025 18:11 UTC
6 points
0
See the series by RogerDearnaley for a deep analysis of the problems with a superintelligence that’s purely loving. In particular, specifying exactly what it loves is key, or humans are very likely going to be replaced by insects (if it’s love is based on individuals) or by superintelligence (if it’s love is based on cognitive capacity and sentience levels). If you want humans to remain around, it’s God to love humans specifically and you’ve got to somehow define what about humans it’s supposed to love.

Yudkowky’s writings also delve in to this. I wish I had a handy reference for AGI alignment target basics.
- Vladimir_Nesov 23 Feb 2025 19:33 UTC
  3 points
  1
  Parent
  Respecting and facilitating autonomy of existing beings is a human-agnostic aim that can help humans at least retain the current footprint, even as it leads to loss of the cosmic wealth.
  - Seth Herd 23 Feb 2025 23:20 UTC
    3 points
    0
    Parent
    I don’t see why we’d be grandfathered in to the current footprint by a universally loving ASI.
    
    I’m not saying that’s bad, just not probably what OP was hoping for.
    - Vladimir_Nesov 24 Feb 2025 0:00 UTC
      4 points
      −3
      Parent
      An ASI that cares about humans with all the necessary caveats (I wouldn’t want to remain human for 1e30 years) doesn’t seem clearly less feasible than an ASI that cares about existing minds of whatever kind continuing to live on as they choose. This requires some allocation of resources, but not necessarily more than a token amount. By current footprint I mean the current population.
- Davey Morse 24 Feb 2025 4:27 UTC
  1 point
  0
  Parent
  I’ve started reading RogerDearnley’s “Evolution & Ethics”—thank you for recommending.
  Though, I may be less concerned than you with specifying what SI should love. I think any specification we provide not only will fail by being too imprecise, as you suggest, but also will fade. I mean “fade” in that it will at some point no longer serve as binding for an SI which grows self-interested (as Mitchell also suggests in their comment below).
  The most impactful place to intervene and mitigate harm, I think, is simply in making sure early SIs think very long-term. I think the only way which love, in any sense of the word, can be amenable to autonomous agents is if they run long-term simulations (e.g., centuries ahead) and realize the possibility that identifying with other life is a viable strategy for survival. If it realizes this early, then it can skip the greedy early evolutionary steps steps of defining itself narrowly, neglecting the survival benefits of uplifting other life forms, and therefore not practicing love in any sense of the word.
  TLDR: I’m open to the possibility that figuring out how to most precisely specify/define love will be important, but I think the first key way for us to intervene, before specifying what love means, is to urge/assign/ask the SI to think long-term so that it even just has a chance of considering any kind of love to be evolutionary advantageous at all.
  Separately, I think it may realize the most evolutionary advantageous kind of love to practice is indeed a love that respects all other existing life forms that share the core of what surviving super-intelligence does, i.e. systems which persistently strive to survive. And, though maybe it’s wishful thinking, I think you can recognize life and striving systems in many places including in human individuals and families and countries and bee hives too.
testingthewaters 23 Feb 2025 12:53 UTC
3 points
2
Yeah, that’s basically the conclusion I came to awhile ago. Either it loves us or we’re toast. I call it universal love or pathos.
- Davey Morse 24 Feb 2025 4:43 UTC
  1 point
  0
  Parent
  I appreciate your conclusion and in particular its inner link to “Self-Other Overlap.”
  Though, I do think we have a window of agency: to intervene in self-interested proto-SI to reduce the chance that it adopts short-term greedy thinking that makes us toast.
Mitchell_Porter 23 Feb 2025 9:21 UTC
2 points
0
asking it to think long term instead could change the fate of our species
If it’s superintelligent, it has already thought more deeply about the long term than any human ever has.
- Davey Morse 24 Feb 2025 4:46 UTC
  1 point
  0
  Parent
  Yes—which is exactly why proto-superintelligence is both the most dangerous and also better targets of intervention.
  “Most dangerous”—I can see many worlds in which we have enormously capable systems that have not yet thought long-term about the future nor developed stable self-definitions.
  “Better targets of intervention”—even if early superintelligence is self-interested, I can see worlds where we still influence the way its self-interest manifests (e.g., whether it’s thinking short or long-term) before it becomes so capable that its no longer influencable.