Exploring non-anthropocentric aspects of AI existential safety: https://www.lesswrong.com/posts/WJuASYDnhZ8hs5CnD/exploring-non-anthropocentric-aspects-of-ai-existential (this is a relatively non-standard approach to AI existential safety, but this general direction looks promising).
mishka
First of all, it’s very possible that ASIs will create a world order “caring about ‘all beings’ including humans” (there are plenty of reasons why they might decide that to be beneficial from their viewpoint). Then they won’t kill you (and probably will save you from your natural biological fate).
But if they don’t do create this kind of world order, then the natural environment as we know it is likely to perish as a side effect of their development activity, and large animals are likely to perish with it, and humans are large animals, so… I’d say, “years” (to pave almost all surface with factories and datacenters, while heating the atmosphere quite a bit and sufficiently changing some gas ratios as a side-effect of all that industry). That the main and the most likely “road to ruin” from a rapidly unfolding ASI ecosystem which does not care (if that ASI ecosystem really, really does not care, it might end up blowing the overall local neighborhood together with all the ASIs and everything else, and that might happen even faster if they rapidly develop various revolutionary tech without trying to collaborate on some restraint in that area; I don’t know how likely that risk is, given that they are “supposed to be actually smart”, but it’s very real).
Also dataflow diagrams seem to come from 1970-s: https://en.wikipedia.org/wiki/Data-flow_diagram
Although visual dataflow programming seems to go back to 1960-s: https://en.wikipedia.org/wiki/Dataflow_programming
But yes, a bit later than 1960, but my examples are still quite old.
I think “squared notation” in Scott domains (
, , , , , ) is late 1960-s.
The problem is that the internal disagreements seem to be too sharp, at least for “AI safety”.
Do you consider the “Pause AI”/“Ban AI” and “Align AI”/“Understand AI” movements to be parts of the same single community and do you call for that “single community” to have a common leadership?
I find this difficult to imagine…
There are basically two kinds of plans: 1) Stay in control of AI as it becomes increasingly super-human and increasingly powerful, 2) Stop AI from getting too powerful in the first place. At the moment, there are no good plans of type (1), for staying in control.
This is just a subclass of the space of possible plans. Both (1) and (2) assume that humans stay in control indefinitely, that’s what they have in common with each other.
But there are all kinds of plans which don’t assume that. For example, the original plan by Eliezer to align AI to the Coherent Extrapolated Volition of humanity does not assume humans staying in control indefinitely. And there are many other plans which do not assume humans staying in control indefinitely, trying instead to assure human flourishing via different mechanisms.
Yes, super-capabilities are the inherent danger.
If we can’t/don’t want to avoid them, then we need to ponder more carefully what path is likely to reduce dangers associated with them…
I tend to think something like this is more likely than the kind of intelligence explosions the AI Safety community tends to imagine. And I think it’s a much, much more difficult scenario to navigate.
I also think this is more likely.
And this requires an entirely different approach: one needs to aim for a world order which represents and protects interests of all these different entities and lifeforms, so that they all have vested interest in helping to maintain this kind of world order.
Then there might be a decent chance for a reliable collective security system which survives drastic self-modifications of the overall ecosystem and continues to work through those self-modifications.
Yes, this makes a lot of sense.
To me, the main dichotomy is whether we expect a unipolar world controlled by a singleton or whether we expect a multi-polar world with a lot of agents of varying nature and varying capabilities.
I think a lot of considerations are pointing towards a likely multi-polar diverse world, where one needs to have interests of various entities (that have radically different nature and radically different levels of capabilities) to be taken into account and protected. And so one needs a system of collective control which does that, and protects various entities from being steamrolled.
The technical aspects in regard to the ability of “the world” to constrain a single system from radical misbehavior are somewhat easier in that scenario (since “the world” is collectively very smart), but this is a small subtask of a much more complicated task of figuring out what kinds of invariant properties a self-modifying world of this kind should achieve and reliably maintain and how the collective task of figuring out those invariant properties and reaching the situation where these properties are achieved and reliably maintained should be approached.
Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.
You are making a “logical jump” here, equating “friendly ASI” with “ASI one can control”.
But this assumption is a point of well known contention. There is no consensus that the “control agenda” is the right way to approach this. Many people think that the approach aimed at achieving sustainable control of super-intelligent systems by ordinary humans is exactly the path to an almost certain ruin, for a number of fairly strong reasons.
(I very much doubt that if we burden an already very difficult problem of creating “a friendly world with ASIs” with the additional requirement of this kind of control, it would be possible to find a solution. So I’d like to see more studies of approaches not based on this particular kind of control.)
I think this would be very interesting to follow-up (at a more reasonable hour of the day).
So, Dario indeed seems not to believe in a quick ASI takeover, and in this sense his definition does seem to differ from yours.
But the question is how does this decompose into differences on:
-
inherently achievable levels of intelligence
-
inherent resistance of the world to changes induced by super-high levels of intelligence
-
ability to have those super-high levels of intelligence and the presence of necessary affordances for radical changes (including the ASI takeover), but also the ability to agree to voluntary curtail the extent of those changes (including refraining from a “true takeover”)
My guess (which might be incorrect) is that your main differences with Dario’s viewpoint are on 2), and to some extent perhaps on 3), but less so on 1). So I think it’s worth a follow-up.
(Thanks for the post, it’s very interesting.)
-
I don’t love using continuity arguments, as in ‘if Opus 4.6 had such inclinations and tried something we would have caught it.’ I especially don’t love it this time, and they acknowledge in this case the argument is weaker. I don’t think Opus would have tried, exactly because Opus would have known it would not succeed, so you can’t judge incination distinct from propensity. And I think Anthropic is far too quick to assume that the properties from one model are likely to be copied in the next one.
(bold mine)
I think this indicates a real problem in the current methodology.
It is always a big problem to use properties of a previous model to establish properties of the current one. There is a huge variety of ways this could screw up even the most rigorous correctness-checking process.
We have a well-known example of the Ariane 5 maiden flight disaster, https://en.wikipedia.org/wiki/Ariane_flight_V88, where software failed despite having been “formally proved correct”, because some properties established for Ariane 4 had been incorrectly assumed as valid axioms for Ariane 5.
This was supposed to make a very expensive process of formal correctness proof somewhat less expensive, because otherwise it would be necessary to prove more things from scratch. Unfortunately, taking this route ended up invalidating the whole process of formally establishing correctness.
It’s just an illustration. Here the situation is different, and the traps are different, but the bottom line is that as soon as one is bringing multiple models into a safety and correctness consideration, one is immediately dealing with a much more complex and non-trivial situation one needs to analyze.
I think we have 3 positions (all of which you mention) where we need radical progress ASAP:
-
access to safe age reversal
-
ability to overcome any medical condition
-
access to safe cognitive enhancement for existing individuals
If we have good progress towards these three points, the pressure to “do something now already” will become much lower.
-
When I am pondering how a realistic solution to AI existential safety might look like, it actually seems that a realistic solution incorporates a “seed of that positive vision”.
First of all, the “safety properties of the world”must stay invariant as the overall ecosystem rapidly self-modifies. If this is not accomplished, any safety would be short-lived. This invariance cannot be imposed against “natural interests of almost all powerful actors” (attempting to force that would result in something very fragile which is unlikely to hold; the main reason why most attempts to craft existential safety in a world with super-intelligences look so unpromising is that they tend to be fragile in this way).
So one needs the “safety properties of the world” to be in the interests of a sufficiently wide and powerful community of actors, and, for human flourishing, this class of actors should be wide enough to robustly include humans, the class of “entities whose interests are protected” should be crafted in such a way that it includes humans from the very beginning and in such a way that entities cannot be dropped from this class as things evolve.
So, the first approximation to a “realistic solution” looks like a system where all kind of actors (including humans and including powerful AIs) have enough voice to make sure their interests are taken into account, and where the world order which robustly protects those very diverse interests is maintained.
On one hand, powerful AIs have enough incentive to maintain this kind of world order, as no AI can be certain of its relative power in the future and thus needs some protections in light of that future uncertainty. In particular, no one can be safe if it’s OK to drop entities already belonging to the protected class from that protected class. That’s how one makes sure that powerful AIs want to robustly maintain this kind of system, not because of some artificial trick, but because it is very much in the interest of each AI system to do so.
On the other hand, a system like this automatically has a “seed of the future positive vision” simply because it maintains the mechanisms to continue collective deliberations and to continue taking opinions of participants into account. Basically, the positive vision is the future where individuals and groups are not steamrolled, and where they can continue to figure out what they want and need and where these conversations are properly taken into account. We don’t need to decide what those future wants and needs will be, we just need to make sure that the mechanisms to discover those wants and needs and the mechanisms to properly take those discoveries into account continue to work.
I think the canonical source has been https://fortune.com/2026/03/26/anthropic-says-testing-mythos-powerful-new-ai-model-after-data-leak-reveals-its-existence-step-change-in-capabilities/
But if one types
anthropic mythos
into Google, one sees plenty.
Actually, these are two separate leaks located very closely to each other (so counting the DoW memo leak this makes it three leaks).
There has been a leak of about 3000 unpublished reports internal assets including the info about “Mythos” model testing, and this new separate leak of Claude Code source.
So something does seem to be wrong (might be just too much strain on people, but could be “enemy action” as well).
This is super-interesting, thanks!
I wonder if this explains the observation that LLMs tend to have fragmented world models rather than “holistic ones” (e.g. the Fractured Entangled Representation Hypothesis, https://arxiv.org/abs/2505.11581 and other results in that spirit).
Biologicals are not well-equipped for long-range space travel.
They would need to be heavily modified/reengineered for radical space expansion (so that’s really a strange intermediate case; who knows how the reengineered entities would look like and what they’ll be made of).
Mmmm… if it were technically possible “to run a human at temperature zero” (that is, without all that noise typical for biological neural systems), what should we expect that human to experience (if anything)?
Actually, it’s a good question for David Chalmers :-)
The disagreement might be on what we think about the models being able to do those things you mention there, but in a static “frozen” situation.
To the extent that the models are able to do those things (“true understanding”, “true knowledge”, “true creativity”, etc.) in a static “frozen” world, the notion of “continual learning” is reducible to its conventional interpretation (which is the ability to accommodate and internally integrate new information, new skills, and new discoveries on the fly without degradation of earlier learned skills and qualities).
But if one does not think that their performance for the static “frozen world” and “frozen models” situation is satisfactory, then no, it’s indeed unlikely that those methods would rescue that.
(If one has a situation for some class of models and method where static “frozen” models don’t possess those qualities, but those qualities can be rescued by dynamic “continual learning”, it should not be too difficult to convert those “continual learning” methods into producing “frozen” snapshots having those qualities to a fairly high degree. I think I more or less know how to do that. So, perhaps, your critique of the status quo is not actually about continual learning, but about more fundamental questions, about whether they are capable of “real” learning at all, whether continual or not.)
Yeah, sometimes people also consider deliberate direct mass attacks by ASIs against biologicals, but it’s difficult to imagine why the ASIs would care to do that, given that they will easily dominate anyway.
(However, non-ASI actors (AIs or humans or their combinations), and particularly in scenarios where ASIs don’t exist at all, might consider organizing catastrophic mass attacks against biologicals with super pandemics and such. So one could also ask, “if the lack of governing ASIs is likely to kill me, when does that happen?” I have no idea about timelines, but I think the risks are quite high already.)