Designing for perpetual control
We don't have static software. We have a system which is dynamically learning, changing, rewriting code indefinitely. It's a perpetual motion problem we're trying to solve. In physics, you cannot create [a] perpetual motion device. But in AI, in computer science, we're saying we can create [a] perpetual safety device which will always guarantee that the new iteration is just as safe.
— Roman Yampolskiy, 2024
What kind of machine keeps modifying its internals based on inputs[1] from the world[2], enough for it to keep running autonomously in that world? A perpetual learning machine.
This machine keeps learning, and so keeps modifying itself into new states. But what if only a tiny fraction of the available modification trajectories[3] that allow this machine to keep running are safe? Given enough time, won’t it[4] end up in an unsafe trajectory – in states[5] expressed as outputs propagating as outside effects that lead to the end of 𝑥[6]?
How do you try set the initial state of the machine such that it never moves to even a single modified state (any change in configuration at any level) precipitating the end of 𝑥?
Achieving this requires building in perpetual control. It would be like trying to build a perpetual motion machine, in some ways: initially, you need to build in components with some associated functioning, in hopes that this functioning will be conserved over time.
Consider that:
The motion machine stops running because it loses its energy (potential) over time through its interactions with surroundings.[7] Engineers have tried to somehow redirect any energy expended by the machine back into the machine, but given real entropy/chaos that cannot be corrected using causal mechanisms, this never works.[8]
The learning machine would not, in effect, stop running.[9] Instead, it would lose its potential to run as intended through its interactions with the world. Engineers can try to build in control mechanisms that correct for modifications that otherwise erode intended functioning – moving the system back in line with how it was intended to run (as at least not ending 𝑥). But given the complex evolutionary dynamics involved, this would not work.
“So what?,” you might ask. “How exactly is this hypothetical relevant? No-one is trying to build a perpetual learning machine, and no-one is trying to build in perpetual control.”
Many readers are concerned about a machine that, if built, could take over so much of human work that it would effectively run itself. Once this machine runs by itself and effectively removes any threats, it could keep running for a long time.
Many also identify as longtermists. They worry how such a machine impacts humanity over the very long term. They want machine learning engineers to build in highly reliable mechanisms that, at least, prevent the machine from causing humanity’s extinction.
But the longer the term you solve for, the more the problem resembles perpetual control.[10]
- ^
Inputs (any effects) received from the world will in turn depend on changes happening in that world, including the changes downstream from the machine’s previous outputs.
- ^
The process of modification can also be extrinsic – where the world’s effects cause changes to the machine’s connected components, in ways that still the machine to keep running autonomously (i.e. the implicit learning that results from evolutionary selection).
- ^
The number of available trajectories are potentially infinite. Given that a partial trajectory splits into more at every point of nondeterministic transition (covering each combinatorically possible modification to the machine’s configurations that is reachable next over space and time).
- ^
This machine could be set up to optimise for certain predefined outcomes, as constrained by its capacity to internally track and externally actuate the world toward those outcomes. But it cannot break physics, nor escape the physical dynamics that in general apply to it (only hypothetically, in its idealised form, would this machine perpetuate itself beyond the heat death of the universe).
- ^
The complete state of the machine can be made up of different kinds of configurations. For example, at a higher level there could be computable code, of which some could be crucial for e.g. representing control structures, reference values, or other concepts relevant to safe control. Such crucial code may initially change relatively rarely, or change mostly by getting more finely tuned. Meanwhile, other configurations would undergo more rapid or greater changes that are learning of/adaptive to their changing physical surroundings.
The question here is whether the initial stability of certain configurations seen as particularly important for the implementation of control is enough for preventing any other (hidden, distributed, cascading) modifications from causing the machine to shift onto an unsafe trajectory.
h/t to Robert Kralisch for pointing out this question!
- ^
Here 𝑥 is something complex, and highly contingent on preexisting conditions in the world that the machine’s existence is not contingent on. It could be all humans on Earth. For the sake of this argument, it does not really matter.
- ^
A perpetual motion machine must keep moving and (in the traditional sense of the term) perform work in the world (and thus lose energy to the world) given its initial internal energy, without drawing energy from any external source. This is impossible under the law of conservation of energy.
- ^
A perpetual motion machine is impossible, based on the second law of thermodynamics (as defining of entropy). Based on the more fundamental law of conservation of energy, without which symmetry (i.e. actual logic) does not hold up in physics, a perpetual motion machine is also impossible.
The term ‘impossible’ though tends to raise debates about epistemics on this forum. Impossible means that something is 100% certain to be false. Many rationalists assume Bayesian statistics to be fundamental to epistemics, more so than the law of conservation of energy is (even though without conservation of energy, you cannot have consistent observations of a world in the first place, and materialism no longer works). They posit you could always gain new observations that make you update your prior probability, and therefore could never soundly be a full 100% certain even about laws of the physical world.
For example, Eliezer Yudkowsky claimed there to be some tiny probability that we have yet to learn some laws of physics, based on which one could reach immortality (e.g. as a perpetual learning machine). This would require escaping the heat death of the universe, which in turn requires beating entropy.
When it comes to the existential problem of ‘could fully autonomous AI be controlled to stay safe?’, this kind of wishful workaround is not relevant. We cannot stake humanity on a “shred of hope” that a fundamental physical law or process turns out to be false – one that entire disciplines of science rely on being true and have never falsified. Here, ‘you cannot be 100% certain that it is impossible’ is a nitpick, which ends up distracting from working through relevant reasoning.
If you derive there is a control inequality that is impossible to solve for, and moreover simulate that over the long term (in theory, into perpetuity; in practice, over hundreds of years) that results in the end of humans – based on fundamental never-falsified premises – then other researchers need to try verify this result. If by rigorous verification, the logic is consistent and the premises are empirically sound, that should be enough to divert resources from AGI control research to stopping AGI development.
- ^
A perpetual learning machine is an idealisation of a learning machine that runs for a very, very long time. No learning machine will actually run into perpetuity given entropy (i.e. cannot beat heat death of universe), just like no ideally efficient Carnot engine will exist. But the Carnot cycle is a useful conceptual tool for thinking about engineering an actual engine (in a way that the concept of perpetual motion is not). Similarly, the perpetual learning machine is a useful tool for thinking about actual fully autonomous AI.
- ^
A good friend replied earlier on the draft:
I think this line of thinking is so incredibly important. I think the realization that controllability is impossible is the real “galaxy brain” epiphany – as it were – in AI safety. Many people talks about the control or alignment problem as a kind of one-off, single-step process: we have to align AGI. But, um, no you don’t! You have to align AGI, and AGI+, and AGI++, etc. etc. etc. And you’re telling me (here I’m speaking to the Yudkowskians) that at no point in the infinite iterative process yielding increasingly advanced AI systems that this will result an AI that’s not “aligned” with our values?
One notices an ambiguity here. Is the control in question “control of the ASI ecosystem by humans” (which can’t realistically be feasible, it’s impossible to maintain this kind of control for long, less intelligent entities don’t have competence to control much more intelligent entities) or “control of the ASI ecosystem by itself”?
“Control of the ASI ecosystem by itself” is tricky, but is it different from “control of the humanity by itself”? The ecosystem of humans also seems to be a perpetual learning machine. So the same logic applies.
(The key existential risk for the ASI ecosystem is the ASI ecosystem destroying itself completely together with its neighborhood via various misuses of very advanced tech; a very similar risk to our own existential risk.)
That’s the main problem: more powerful intelligence ⇒ more powerful risks and more powerful capabilities to address risks. The trade-offs here are very uncertain.
One often focuses on this intermediate asymmetric situation where the ASI ecosystem destroys humans, but not itself, and that intermediate situation needs to be analyzed and addressed, this is a risk which is very important for us.
But the main risk case needs to be solved first: the accumulating probability of the ASI ecosystem completely destroying itself and everything around it, the accumulating probability of the humanity completely destroying itself (and a lot around it). The asymmetric risk of the previous paragraph can then be addressed conditional of the risk of “self-destruction with collateral super-damage” being solved (this condition being satisfied should make the remaining asymmetric risk much more tractable).
The risks seem high regardless of the route we take, unfortunately. The perpetual learning machine (the humanity) does not want to stop learning (and with good reasons).