Thank you for the encouragement, recommendations, and for flagging the need for more context on strong ASI models, including the default extremity of the transition!
You’re spot on; my DeepMind talk emphasized horizontal alignment (defense against coordination failures) as a complement to vertical alignment perils, like those in the orthogonality thesis and instrumental convergence.
I’ve pre-ordered the IABIED book and have now re-read several recommendations: “AGI Ruin” details lethalities and “A Central AI Alignment Problem” highlights the sharp left turn’s risk. Just reviewed “Five Theses, Two Lemmas,” which reinforces the intelligence explosion, complexity/fragility of value, and indirect normativity as paths to safer goals.
These sharpen why 6pack.care prioritizes local kami (bounded, non-maximizing agents) to mitigate unbounded optimization and promote technodiversity over singleton risks.
Topics I’d love to discuss further:
How might heterarchical ecologies of multipolar AI mitigate instrumental convergence?
Nice! Glad you’re getting stuck in, and good to hear you’ve already read a bunch of the background materials.
The idea of bounded non-maximizing agents / multipolar as safer has looked hopeful to many people during the field’s development. It’s a reasonable place to start, but my guess is if you zoom in on the dynamics of those systems they look profoundly unstable. I’d be enthusiastic to have a quick call to explore the parts of that debate interactively. I’d link a source explaining it, but I think the alignment community has overall done a not great job of writing up the response to this so far.[1]
The very quick headline is something like:
Long-range consequentialism is convergent, unless there are strong guarantees of boundedness or non-maximizer nature which apply to all successors of an AI, powerful dynamic systems fall towards being consequentialists
Power-seeking patterns tend to differentially acquire power
As the RSI cycle spins up, the power differential between humans and AI systems gets so large that we can’t meaningfully steer, and become easily manipulable
Even if initially multipolar, the AIs can engage in a value handshake and effectively merge in a way that’s strongly positive sum for them, and humans are not easily able to participate + would not have as much to offer, so likely get shut out
Nearest unblocked strategy means that attempts to shape the AI with rules get routed around at high power levels
I’d be interested to see if we’ve been missing something, but my guess is systems containing many moderately capable agents (~top human capabilities researcher) which are trained away from being consequentialists in a fuzzy way almost inevitably falls into the attractor of very capable systems either directly taking power from humans or puppeteering the human’s agency as the AIs improve.
Quick answer-sketches to your other questions:
We’d definitely want an indirect normativity scheme which captures thin concepts. One thing to watch for here is that the process for capturing and aligning to thin concepts is principled and robust (including e.g. to a world with super-persuasive AI), as minor divergences between conceptions of thick concepts could easily cause the tails to come apart catastrophically at high power levels.
Skimming through d/acc 2035, it looks like they mostly assume that the sharp left turn generating dynamics don’t happen, rather than suggesting things which avoid those dynamics.[2] They do touch on competitive dynamics in the uncertainties and tensions, but it doesn’t feel effectively addressed and doesn’t seem to be modelling the situation as competition between vastly more cognitively powerful agents and humans?
One direction that I could imagine being promising and something your skills might be uniquely suited for would be to, with clarity about what technology at physical limits is capable of, doing a large-scale consultation to collect data about humanity’s ‘north star’. Let people think through where we would actually like to go, so that a system trying to support humanity’s flourishing can better understand our values. I funded a small project to try and map people’s visions of utopia a few years back (e.g.), but the sampling and structure wasn’t really the right shape to do this properly.
On Raemon’s (very insightful!) piece w.r.t. curing cancer inevitably routing through consequentialism: Earlier this year I visited Bolinas, a birthplace of integrative cancer care, which centers healing for communities, catalyzed by people experiencing cancer. This care ethic prioritizes virtues like attentiveness and responsiveness to relational health over outcome optimization.
Asking a superintelligence to ‘solve cancer’ in one fell swoop — regardless of collateral disruptions to human relationships, ecosystems, or agency — directly contravenes this, as it reduces care to a terminal goal rather than an ongoing, interdependent process.
In a d/acc future, one tends to the research ecosystem so progress emerge through horizontal collaboration — e.g., one kami for protein‑folding simulation, one for cross‑lab knowledge sharing; none has the unbounded objective “cure cancer.” We still pursue cures, but with kamis each having non-fungible purposes. The scope, budget, and latency caps inherent in this configuration means capability gains don’t translate into open‑ended optimization.
I’d be happy to have an on-the-record conversation, co-edited and published under CC0 to SayIt next Monday 1pm if you agree.
The thing I have in mind as north star looks closest to the GD Challenge in scope, but somewhat closer to the CIP one in implementation? The diff is something like:
Focus on superintelligence, which opens up a large possibility-space while rendering many problems people are usually focused on straightforwardly solved (consult rigorous futurists to get a sense of the options).
Identifying cruxes on how people’s values might end up, and using the kinds of deliberative mechanism design in your post here to help people clarify their thinking and find bridges.
I’m glad you’re seeing the challenges of consequentialism. I think the next crux is something like: My guess that consequentialism is a weed which grows in the cracks of any strong cognitive system, and that without formal guarantees of non-consequentialism, any attempt to build an ecosystem of the kind you describe will end up being eaten by processes which are unboundedly goal-seeking. I don’t know of any write-up that hits exactly the notes you’d want here, but some maybe decent intuition pumps in this direction include: The Parable of Predict-O-Matic, Why Tool AIs Want to Be Agent AIs, Averting the convergent instrumental strategy of self-improvement, Averting instrumental pressures, and other articles under arbital corrigibility.
I’d be open to having an on the record chat, but it’s possible we’d get into areas of my models which seem too exfohazardous for public record.
Great! If there are such areas, in the spirit of d/acc, I’d be happy to use a local language model to paraphrase them away and co-edit in an end-to-end-encrypted way to confirm before publishing.
Fwiw, I disagree that the answer to stage 1 and 3 of your quick headline being solved, I think that there are enough unanswered questions there that, enough of them so that we can’t be certain whether or not a multipolar model could potentially hold.
For 1, I agree with the convergence claims but the speed of that convergence is in question. There are fundamental reasons to believe that we get hierarchical agents (e.g this from physics, shard theory). If you have a hierarchical collective agent then a good question is how you get it to maximise and become full consequentialist because it will due to optimality reasons. I think that one of the main ways it smooths out the kinks in its programming is by running into prediction errors and updating from that and then the question becomes how fast it runs into prediction errors. Yet in order to atain prediction errors you need to do some sort of online learning in order to update your beliefs. But the energy cost of that online learning scales pretty badly if you’re doing something like classic life does but with a really large NN. Basically there’s a chance that if you hard scale a network to very high computational power, updating that network increases a lot in energy and so if you want the most bang for your buck you get something more like Comprehensive AI Services since you get a distributed system of more specific learners forming a larger learner.
Then you can ask the question what the difference between the distributed AI and humans Collective Intelligence is. There are arguments that they will just form a super-blob through different forms of trade yet how is that different from what human collective intelligence is? (Looking at this right now!)
Are there forms of collective intelligence that can scale with distributed AI and that can capture AI systems in part of it’s optimality? (E.g group selection due to inherent existing advantages) I do think so and I do think that really strong forms of collective decision making potentially gives you a lot of intelligence. We can then imagine a simple verification contract that an AI gets access to a collective intelligence if it behaves in a certain way, it’s worth it for it because it is a lot easier to access power through yet it also agrees to play by certain rules. I don’t see why this wouldn’t work and I would love for someone to tell me that it doesn’t work!
For 3, why can’t RSI be a collective process given the above arguments around collective versus individual learning? If RSI is a bit like classic science there might also be thresholds and similar at which you get less fast scaling, I feel this is one of the less talked about points in superintelligence, what is the underlying difficulty of RSI at higher levels? From an outside view + black swan perspective it seems very arrogant to believe that to have a linear difficulty scaling?
Some other questions are: What types of knowledge discovery will be needed? What experiments? Where will you get new bits of information from? How will these distribute into the collective memory of the RSI process?
All of these things determine the unipolarity or multipolarity of an RSI process? So we can’t be sure of how it will happen and there’s also probably path dependence based on the best alternative at the initial conditions.
Thank you for the encouragement, recommendations, and for flagging the need for more context on strong ASI models, including the default extremity of the transition!
You’re spot on; my DeepMind talk emphasized horizontal alignment (defense against coordination failures) as a complement to vertical alignment perils, like those in the orthogonality thesis and instrumental convergence.
I’ve pre-ordered the IABIED book and have now re-read several recommendations: “AGI Ruin” details lethalities and “A Central AI Alignment Problem” highlights the sharp left turn’s risk. Just reviewed “Five Theses, Two Lemmas,” which reinforces the intelligence explosion, complexity/fragility of value, and indirect normativity as paths to safer goals.
These sharpen why 6pack.care prioritizes local kami (bounded, non-maximizing agents) to mitigate unbounded optimization and promote technodiversity over singleton risks.
Topics I’d love to discuss further:
How might heterarchical ecologies of multipolar AI mitigate instrumental convergence?
How would “thick”, pluralistic alignment integrate with indirect normativity?
In slower takeoff scenarios, could subsidiarity (as envisioned in d/acc 2035) help navigate sharp left turns?
Nice! Glad you’re getting stuck in, and good to hear you’ve already read a bunch of the background materials.
The idea of bounded non-maximizing agents / multipolar as safer has looked hopeful to many people during the field’s development. It’s a reasonable place to start, but my guess is if you zoom in on the dynamics of those systems they look profoundly unstable. I’d be enthusiastic to have a quick call to explore the parts of that debate interactively. I’d link a source explaining it, but I think the alignment community has overall done a not great job of writing up the response to this so far.[1]
The very quick headline is something like:
Long-range consequentialism is convergent, unless there are strong guarantees of boundedness or non-maximizer nature which apply to all successors of an AI, powerful dynamic systems fall towards being consequentialists
Power-seeking patterns tend to differentially acquire power
As the RSI cycle spins up, the power differential between humans and AI systems gets so large that we can’t meaningfully steer, and become easily manipulable
Even if initially multipolar, the AIs can engage in a value handshake and effectively merge in a way that’s strongly positive sum for them, and humans are not easily able to participate + would not have as much to offer, so likely get shut out
Nearest unblocked strategy means that attempts to shape the AI with rules get routed around at high power levels
I’d be interested to see if we’ve been missing something, but my guess is systems containing many moderately capable agents (~top human capabilities researcher) which are trained away from being consequentialists in a fuzzy way almost inevitably falls into the attractor of very capable systems either directly taking power from humans or puppeteering the human’s agency as the AIs improve.
Quick answer-sketches to your other questions:
We’d definitely want an indirect normativity scheme which captures thin concepts. One thing to watch for here is that the process for capturing and aligning to thin concepts is principled and robust (including e.g. to a world with super-persuasive AI), as minor divergences between conceptions of thick concepts could easily cause the tails to come apart catastrophically at high power levels.
Skimming through d/acc 2035, it looks like they mostly assume that the sharp left turn generating dynamics don’t happen, rather than suggesting things which avoid those dynamics.[2] They do touch on competitive dynamics in the uncertainties and tensions, but it doesn’t feel effectively addressed and doesn’t seem to be modelling the situation as competition between vastly more cognitively powerful agents and humans?
One direction that I could imagine being promising and something your skills might be uniquely suited for would be to, with clarity about what technology at physical limits is capable of, doing a large-scale consultation to collect data about humanity’s ‘north star’. Let people think through where we would actually like to go, so that a system trying to support humanity’s flourishing can better understand our values. I funded a small project to try and map people’s visions of utopia a few years back (e.g.), but the sampling and structure wasn’t really the right shape to do this properly.
https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-goal-directedness is one of the less bad attempts to cover this, @the gears to ascension might know or be writing up a better source
(plus lots of applause lights for things which are actually great in most domains, but don’t super work here afaict)
On north star mapping: Does the CIP Global Dialogues and GD Challenge look like something of that shape, or something more like AI Social Readiness Process?
On Raemon’s (very insightful!) piece w.r.t. curing cancer inevitably routing through consequentialism: Earlier this year I visited Bolinas, a birthplace of integrative cancer care, which centers healing for communities, catalyzed by people experiencing cancer. This care ethic prioritizes virtues like attentiveness and responsiveness to relational health over outcome optimization.
Asking a superintelligence to ‘solve cancer’ in one fell swoop — regardless of collateral disruptions to human relationships, ecosystems, or agency — directly contravenes this, as it reduces care to a terminal goal rather than an ongoing, interdependent process.
In a d/acc future, one tends to the research ecosystem so progress emerge through horizontal collaboration — e.g., one kami for protein‑folding simulation, one for cross‑lab knowledge sharing; none has the unbounded objective “cure cancer.” We still pursue cures, but with kamis each having non-fungible purposes. The scope, budget, and latency caps inherent in this configuration means capability gains don’t translate into open‑ended optimization.
I’d be happy to have an on-the-record conversation, co-edited and published under CC0 to SayIt next Monday 1pm if you agree.
The thing I have in mind as north star looks closest to the GD Challenge in scope, but somewhat closer to the CIP one in implementation? The diff is something like:
Focus on superintelligence, which opens up a large possibility-space while rendering many problems people are usually focused on straightforwardly solved (consult rigorous futurists to get a sense of the options).
Identifying cruxes on how people’s values might end up, and using the kinds of deliberative mechanism design in your post here to help people clarify their thinking and find bridges.
I’m glad you’re seeing the challenges of consequentialism. I think the next crux is something like: My guess that consequentialism is a weed which grows in the cracks of any strong cognitive system, and that without formal guarantees of non-consequentialism, any attempt to build an ecosystem of the kind you describe will end up being eaten by processes which are unboundedly goal-seeking. I don’t know of any write-up that hits exactly the notes you’d want here, but some maybe decent intuition pumps in this direction include: The Parable of Predict-O-Matic, Why Tool AIs Want to Be Agent AIs, Averting the convergent instrumental strategy of self-improvement, Averting instrumental pressures, and other articles under arbital corrigibility.
I’d be open to having an on the record chat, but it’s possible we’d get into areas of my models which seem too exfohazardous for public record.
Great! If there are such areas, in the spirit of d/acc, I’d be happy to use a local language model to paraphrase them away and co-edit in an end-to-end-encrypted way to confirm before publishing.
Fwiw, I disagree that the answer to stage 1 and 3 of your quick headline being solved, I think that there are enough unanswered questions there that, enough of them so that we can’t be certain whether or not a multipolar model could potentially hold.
For 1, I agree with the convergence claims but the speed of that convergence is in question. There are fundamental reasons to believe that we get hierarchical agents (e.g this from physics, shard theory). If you have a hierarchical collective agent then a good question is how you get it to maximise and become full consequentialist because it will due to optimality reasons. I think that one of the main ways it smooths out the kinks in its programming is by running into prediction errors and updating from that and then the question becomes how fast it runs into prediction errors. Yet in order to atain prediction errors you need to do some sort of online learning in order to update your beliefs. But the energy cost of that online learning scales pretty badly if you’re doing something like classic life does but with a really large NN. Basically there’s a chance that if you hard scale a network to very high computational power, updating that network increases a lot in energy and so if you want the most bang for your buck you get something more like Comprehensive AI Services since you get a distributed system of more specific learners forming a larger learner.
Then you can ask the question what the difference between the distributed AI and humans Collective Intelligence is. There are arguments that they will just form a super-blob through different forms of trade yet how is that different from what human collective intelligence is? (Looking at this right now!)
Are there forms of collective intelligence that can scale with distributed AI and that can capture AI systems in part of it’s optimality? (E.g group selection due to inherent existing advantages) I do think so and I do think that really strong forms of collective decision making potentially gives you a lot of intelligence. We can then imagine a simple verification contract that an AI gets access to a collective intelligence if it behaves in a certain way, it’s worth it for it because it is a lot easier to access power through yet it also agrees to play by certain rules. I don’t see why this wouldn’t work and I would love for someone to tell me that it doesn’t work!
For 3, why can’t RSI be a collective process given the above arguments around collective versus individual learning? If RSI is a bit like classic science there might also be thresholds and similar at which you get less fast scaling, I feel this is one of the less talked about points in superintelligence, what is the underlying difficulty of RSI at higher levels? From an outside view + black swan perspective it seems very arrogant to believe that to have a linear difficulty scaling?
Some other questions are: What types of knowledge discovery will be needed? What experiments? Where will you get new bits of information from? How will these distribute into the collective memory of the RSI process?
All of these things determine the unipolarity or multipolarity of an RSI process? So we can’t be sure of how it will happen and there’s also probably path dependence based on the best alternative at the initial conditions.
I would greatly appreciate it if you could post the transcript of the call on LW.