Is Constructor Theory a useful tool for AI alignment?

A.H.29 Nov 2022 12:35 UTC

11 points

This is an attempt to make sense of some vague ideas and explain some intuitions I’ve had recently. Written in the spirit of How To Write Quickly While Maintaining Epistemic Rigor .

“Any headline that ends in a question mark can be answered by the word ‘no’.”—Betteridge’s Law of Headlines

Introduction

Recently, I have been thinking about constructor theory—a broad philosophical approach to providing explanations in physics. It was developed by David Deutsch, who claims that it has the potential to solve problems in physics that are hard to solve in other frameworks. When I read Deutsch’s paper, I had the intuition that some of the problems in physics that constructor theory was meant to solve were analogous to problems associated with AI alignment. I thought that if constructor theory was able to solve these problems in physics, it might be helpful in the field of AI alignment.

I have been thinking about this on and off for the last few months and now believe that constructor theory probably does not have much to offer the field of AI alignment. I came to this conclusion broadly for two reasons. Firstly, constructor theory has few (zero?) concrete tools that could be imported to studying AI. Despite a lot of media coverage, constructor theory has only had limited success in solving the problems in physics that it set out to solve. It has produced a couple of interesting results, but they are very much results about physics, that (as far as I can see) cannot be easily imported for studying AI. Secondly, and maybe more importantly, the analogies between constructor theory and AI alignment break down in some important ways, once one starts digging a bit deeper.

In my estimation, this is a bad alignment idea. John Wentworth writes ‘it takes most newcomers ~5 years to go from their initial Bad Idea to actually working on something plausibly useful’ and suggests that ‘progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans’. In this spirit, in this piece, I will try to steelman the intuitions that constructor theory might be useful for AI alignment, and then point out why I now think that these intuitions are wrong.

The ‘Dynamical Laws’ Approach to Physics

Often, in physics, we describe systems using dynamical laws and initial conditions. The initial conditions specify the state of a system at a particular time and the dynamical laws specify how that state evolve. For example, in classical mechanics, one can completely describe a system using the positions and momenta of each particle at a time t (the initial conditions) and then use dynamical laws, (such as Newton’s laws or Hamilton’s or Lagrange’s equations) to find out what the positions and momenta of each particle will be at any time in the future. Similarly, in quantum mechanics, the initial conditions are specified by a quantum state vector (or density operator) and the dynamical law which governs its evolution is the Schrödinger equation (or the Von Neumann equation). David Deutsch calls this approach to physics ‘the prevailing conception of physics’. This terminology has the potential to cause confusion, so I will simply call this approach ‘the dynamical laws approach to physics’ or ‘the dynamical laws picture’. This mode of explanation has been hugely successful and widely used in physics, including in quantum mechanics, classical mechanics and general relativity ^[1].

Deutsch claims that there are certain problems in physics which are difficult or impossible to solve using the dynamical laws approach. I’ll list a few of these here. My treatment of each will be brief and is not meant to be definitive.

Example Problem 1: Loschmidt’s Paradox

All known microscopic dynamical laws (such as those that underpin classical and quantum mechanics) are time-reversible (meaning that taking a physical evolution and ‘reversing’ the direction of time leads to an equally valid physical evolution). However, the second law of thermodynamics is time-irreversible. Since the macroscopic systems which obey the second law are composed of microscopic components, all of which must obey the reversible dynamical laws, we have reached a paradox, since it should not be possible to derive an irreversible process from time-symmetric dynamics. This is known as Loschmidt’s Paradox. Some proposals claim to have solved this paradox by coarse-graining or averaging over physical states. For example, one can describe the second law in terms of the increase in entropy and entropy as a measure of uncertainty of an observer (see eg. this piece). These solutions are very elegant, but also make thermodynamics into a claim about knowledge. Some people are satisfied with these kind of solutions to Loschmidt’s paradox (I am!) but others, like Deutsch, are not since they make the second law somewhat subjective. Either way, it is important to note that these solutions relied adding other elements (such as averaging or some kind of observer) that are not present when purely describing systems using dynamical laws. The dynamical laws approach, on its own, is inadequate to address this problem.

Example Problem 2: Describing Information, Computability, and other Abstract and Emergent Properties

If the complete characterisation of a system consists of knowing its dynamical laws and evolution, then how do we account for emergent properties, such as information? If we wish to incorporate concepts like information into physics, the dynamical laws approach to physics makes this hard. The dynamical laws picture of physics can perfectly describe evolution of each transistor in a computer without ever referring to concepts like ‘bits’ or ‘information’, yet these abstract concepts are often useful to us and allow us to understand the computer in a different way and seem to have implications for the physical behaviour of the computer. This is analogous to Douglas Hofstadter’s thought experiment, from I Am a Strange Loop. In this piece, Hofstadter asks us to imagine a domino computer (which he calls a ‘domino chainium’), set up to compute whether or not the number 641 is prime. A particular domino will fall if and only if the computation finds a divisor of 641, meaning that it is not prime. Hofstadter then poses the question: ‘Why does this domino not fall over? ’ and examines some possible answers:

Let me contrast two very different types of answer that someone might give. The first type of answer — myopic to the point of silliness — would be, “Because its predecessor never falls, you dummy!” To be sure, this is correct as far as it goes, but it doesn’t go very far. It just pushes the buck to a different domino, and thus begs the question.
The second type of answer would be, “Because 641 is prime.” Now this answer, while just as correct (indeed, in some sense it is far more on the mark), has the curious property of not talking about anything physical at all. Not only has the focus moved upwards to collective properties of the chainium, but those properties somehow transcend the physical and have to do with pure abstractions, such as primality.
...
The point of this example is that 641’s primality is the best explanation, perhaps even the only explanation, for why certain dominos did fall and certain other ones did not fall.

Furthermore, because the 641 primality argument does not refer specifically to the dynamical laws of the situation, it easily be extended to explain different physical instantiations of computers. One could look for the prime factors of 641 using a computer made out of dominos, transistors, or soldier crabs. These computers would all have very different descriptions in terms of dynamical laws, but some version of the “because 641 is prime” argument would apply to all of them. There seems to be some properties of these systems that are ‘substrate independent’, meaning that they apply regardless of the material (substrate) which is being used. Note that one could completely describe the behaviour of the domino computer in terms of the dynamical laws and initial conditions, which would explain why the domino does not fall, but one would miss out on the ‘deeper’ explanation in terms of the primality of 641.

Deutsch worries we are making a similar mistake more generally in physics. Despite being abstract, emergent and substrate-independent, information seems to be important in physics (for example, in the black-hole information paradox or in describing properties of a physically-instantiated computer). Using the dynamical laws approach, information is relegated to a epiphenomenon, or emergent property. Deutsch is concerned that, by treating information in this way, we might be missing out on important ways to explain the behaviour of the physical world.

Example Problem 3: Reasoning about Hybrid Systems

Our best theory of gravity (general relativity) is incompatible with the best theory of everything else (quantum mechanics/quantum field theory). For practical purposes, general relativity provides a set of laws governing the evolution of large-scale systems (where the features of quantum mechanics which contradict it become negligible), and quantum mechanics provides the laws governing small-scale systems (where gravity, and hence the influence of general relativity is negligible). Soon, it may be possible to create ‘hybrid systems’, where both gravity and quantum effects are non-negligible. In the dynamical laws approach, we do not know how such systems will behave, because we do not know what dynamical laws will be. Is there a way to reason about such systems, in the absence of dynamical laws?

Example Problem 4: The Initial State Problem

In the dynamical laws approach, the state of any system can be described as the result of an evolution, according to dynamical laws, starting from an initial state. The initial state of a system must be provided by fiat. If we wish to explain why the initial state is how it is using the dynamical laws approach, we can only explain it as the result of evolution from a previous initial state. If we wish to explain that state, we again must explain it in terms of evolution from an even earlier state. Tracing this process back leads us to the initial state of the universe (or alternatively, there is no initial state of the universe and we keep tracing back forever). But what is the initial state of the universe, and why is the way that it is? In the dynamical laws approach, the initial state must be plugged in as a parameter of the explanation. It cannot be explained in terms of evolution from an earlier state, as there was no earlier state. If we wish to explain the initial state of the universe, Deutsch claims that we a mode of explanation different to the dynamical laws approach.

Section Conclusion

Deutsch argues that these problems present in the dynamical laws picture indicate that we might need another approach to formulate the laws of physics and his proposal for such an approach is constructor theory. Note however, that just because these problems are hard to solve in the dynamical laws approach, it does not mean that physicists do not have use for these concepts. When I have discussed this before I was accused of claiming that the second law of thermodynamics is not part of physics. This is not what I am saying (or what David Deutsch is saying)! Indeed, I think that the claim being made is in fact the opposite. In practice, physicists do invoke concepts like information and entropy, and the second law of thermodynamics regularly and with great success. They seem to be important for understanding the world, and yet a description of a system in terms of its initial conditions and dynamical laws will not mention them at all. The claim being made is that this is a problem that will prevent us from making progress in some areas of physics.

Constructor Theory

Constructor theory is an approach to physics which aims to solve some of these problems. It is proposed as an alternative to the dynamical laws approach to physics, though it does not require that all of the successes of the dynamical laws approach are thrown out. Deutsch claims that some of the problems of the dynamical laws approach come from the fact that, given a set of initial conditions, this approach can tell you what will happen in a particular situation, but it is often difficult to capture notions of what is in principle possible or impossible. He gives the example of describing why a particular perpetual motion machine will not work. The dynamical laws approach would tell us that the machine won’t work because the torque on one of the axels isn’t large enough, but any physicist would just tell you that it is impossible to build a perpetual motion machine and be done with it. The explanation ‘it is impossible to build a perpetual motion machine’ is a deeper statement about the universe and rules out a vast number of physical phenomena. While the dynamical laws approach tells us that this particular perpetual motion machine will not work, the claim ‘it is impossible to build a perpetual motion machine’ rules out all possible candidate perpetual motion machines that we could try to build in the future. Similarly, Deutsch points out that, while, Loschmidt’s paradox makes it hard to describe the second law of thermodynamics in terms of dynamical laws, the second law can also be framed as the statement ‘it is impossible to engineer a cyclic process which converts heat entirely into work’. Again, this is a situation where a statement about the possibility/impossibility of certain physical transformations seems to provide a richer description of the world than the dynamical laws picture. Deutsch loosely uses the term ‘counterfactual’ to describe this kind of statement about what is in principle possible or impossible. This terminology has potential to cause confusion as it has other uses which are slightly different, so I will not use it here. Instead, at the risk of making the text slightly clunky, I will just describe this kind of statements as ‘possibility/impossibility statements’ ^[2].

As the examples above have hopefully shown, possibility/impossibility statements are often used widely in physics. Constructor theory is an attempt to formalise statements of this kind and make them precise. Often, these kind of statement are taken to be ‘downstream’ of dynamical laws explanations. For example, one might claim that the statement ‘it is impossible to make a perpetual motion machine’ is true as a result of the dynamical laws governing the machine. This is true in one direction: if the dynamical laws were sufficiently different, then maybe they would allow perpetual motion machines to be built. But Deutsch inverts this hierarchy between possibility/impossibility statements and dynamical laws explanations. Rather than claiming that dynamical laws restrict what is possible or impossible, we could claim that the possibility/impossibility of certain transformations is more fundamental than dynamical laws and that the statement ‘it is impossible to make a perpetual motion machine’ restricts the form of the dynamical laws which are allowed in the universe. Deutsch claims that the dynamical laws picture cannot be fundamental due to the aforementioned problems which he believes it has identified with it. As a result, constructor theory, in the form of possibility/impossibility statements would underlie all existing theories and explain the form of all physical dynamical laws. Just in case you think that I am exaggerating the scope of this claim, here is a quote from Deutsch:

Other theories specify what substrates and tasks exist, and provide the multiplication tables for serial and parallel composition of tasks, and state that some of the tasks are impossible, and explain why. Constructor Theory provides a unifying formalism in which other theories can do this, and its principles constrain their laws, and in particular, require certain types of task to be possible. I shall call all scientific theories other than constructor theory subsidiary theories.

This is a big claim: all other scientific theories (including quantum mechanics and general relativity ) are relegated to the banner of ‘subsidiary theories’! These theories (and the dynamical laws associated with them) would flow from constructor-theoretic principles which specify what is possible or impossible. In the concluding paragraph, Deutsch writes with slightly more modesty:

The principles of constructor theory that I have proposed may be false… But if the idea as a whole is false, something else will have to remedy the deficiencies of the prevailing conception. Something else will unify emergent-level laws such as the Turing principle and the principle of testability with the other laws of nature. Something else will provide an exact statement of the second law of thermodynamics, and a full statement of the content of conservation laws. A different approach will generalise the theory of computation and von Neumann’s constructor theory, and support laws about substrate-independent quantities such as information.

I think that this is the best angle from which to approach constructor theory. Pointing out problems in the dynamical laws approach to physics and trying to find solutions is useful, even if constructor theory turns out not to be the best solution to them.

Instead of describing physical systems in terms of dynamical laws, constructor theory is expressed using ‘tasks’, which involve the transformations of attributes of ‘substrates’ (physical systems). The fact that attributes of substrates, rather than substrates themselves are used, allows for ‘substrate-independent’ tasks to be expressed. For example, if we have a task $P$ , which takes an attribute $x_{1}$ or an attribute $x_{2}$ as inputs and transforms them respectively to $y_{1}$ or $y_{2}$ , this task would be written:

P = {x_{1} \to y_{1}, x_{2} \to y_{2}} .

If the task $P$ is possible, within the laws of physics, this is written $P^{✓}$ . If it is impossible, we write $P^{\times}$ . I don’t think that further discussion of the formalism is particularly fruitful, but wanted to give a bit of a flavour for what constructor theory papers look like. In a later section, I’ll give a brief introduction to the constructor theory of information, but if you are interested in more details, they can be found in the papers here and here.

I feel that I need to emphasise that constructor theory does not require that we abandon all attempts to describe the world using dynamical laws. Statements about the possibility and impossibility of certain physical transformations do not replace dynamical laws, but they can restrict the form of the laws which are allowed, and they may guide our search for new dynamical laws. Constructor theory aims to supplement the dynamical laws by providing a tool to tackle problems that are difficult to address within the dynamical laws framework.

What useful things have people done using Constructor Theory?

The short answer is ‘not a huge amount’. When I started this project, I assumed that there was a bit more to it, but the reality is that there are a handful of people working on it and they have produced around a dozen papers, several of which are not explicitly about constructor theory, but are somehow adjacent. Probably the most concrete and developed aspect of constructor theory is ‘The Constructor Theory of Information’, which formalises the idea of what it means for a physical system to embody information, in terms of what physical transformations are possible or impossible in that system. The theory is very general, finding application to both quantum and classical information and is substrate-independent, which, as discussed earlier means that it does not rely on any specific physical embodiment. Though the concept of information is hard to express in terms of dynamical laws, it finds a natural expression in the formalism of constructor theory, in terms of a statement about the possibility of performing certain tasks on a substrate.

This theory has then been used to make some arguments about the kind of behaviour one would expect to see in hybrid quantum-gravitational systems, with referring to the specific laws of physics that would be obeyed in those systems. In particular, the authors examine the kind of information-theoretic properties that gravity must have if it is able to generate entanglement between two quantum systems. There has also been some work on thermodynamics and formalising the irreversibility and the second law in constructor-theoretic terms.

Aside: The Constructor Theory of Information

This section is for readers who would like some more details on the constructor theory of information, but can safely be skipped by those who do not. I mentioned earlier that the Constructor Theory of Information uses the tasks which are possible or impossible to perform on a system as a way of defining what it means for a physical system to embody information. In particular Deutsch and Marletto identify two tasks which must be possible on a substrate in order for that substrate to carry (classical) information. The first task is that of reversible computation, ie. performing some permutation over a set of at least two attributes. A simple example of this would be that, if you had a torch that you wished to use to send Morse Code, it is no use if the switch of the torch didn’t work. You must be able to switch between turning the torch on and off in order to send a signal. In this example, the substrate would be the torch (and the space in front of the torch which gets illuminated) and ‘being on’ and ‘being off’ would be two attributes of the substrate. Performing a reversible computation on this substrate would be the process of permuting between the ‘on’ and ‘off’ states ie. turning the torch on and off. This process of reversible computation is described as a constructor-theoretic task :

C_{Π} = ⋃ x \in S {x \to Π (x)},

where $S$ is a set of attributes over the substrate, and $Π (x)$ represents a permutation over the elements $x$ of this set.

The second task that must be possible for a substrate to embody information is that of copying or cloning. More specifically, it must be possible to copy the state from one substrate to another similar substrate. This is related to the fact that the different states of the substrate must be distinguishable. If the torch is completely hidden and there is no way to tell whether it is on or off, then you cannot distinguish the two states. On the other hand, if there is a second torch and it is somehow possible for you to copy the state of the first torch on to the second, this is a proxy for being able to distinguish the two states. This copying or cloning process is expressed as the following task, which is implemented on two similar substrates:

R = ⋃ x \in S {(x, x_{0}) \to (x, x)},

where we have used the notation $(x, x_{0})$ to indicate that the first substrate is in the state $x$ and the second substrate is in some initial state $x_{0}$ . After the transformation, the first substrate is unchanged, and the second substrate is in the same state as the first. Some readers may be wondering if this copying task contradicts the no cloning theorem in quantum mechanics. This is addressed in more detail in the paper, which covers both classical and quantum information. The short answer is that the no-cloning theorem forbids complete cloning of a quantum state (ie. cases where the set $S$ covers all attributes of the substrate), but does not prevent cloning over a restricted subset of the attributes of the substrate ^[3].

Both of these tasks (reversible computation and copying) must be possible for a substrate to carry information. Note that, if we described a system purely using the dynamical laws picture, we would not be able to express these possibility statements, and therefore would not be able to express the concept of information. Allowing for this kind of statements enables us to discuss the concept of information. Without getting too much into the weeds, I hope that this gives a flavour of what constructor-theory based physics might look like.

What is the link between Constructor Theory and AI alignment?

Despite its limited success in physics, I thought that I had spotted a few parallels between the kinds of problems in physics that constructor theory was meant to address and the kind of problems that often get lumped under the banner of ‘AI Alignment’.

In both physics and AI research, systems are mostly described by a combination of ‘dynamical laws’ and ‘initial conditions’

I have already described the ‘dynamical laws and initial conditions’ approach which is widely used to understand physical systems. AIs are often modelled and created by following a similar logic. In modern machine learning, the ‘initial conditions’ of a model consists of input data (either training data, or a ‘real world’ input), along with the initial parameters of the model. The ‘dynamical laws’ consist of the algorithm which specifies how the output will be generated from the input (eg. matrix multiplication and an activation function) and how the weights and biases will update in response to the input (eg. gradient descent ) ^[4] . More generally and abstractly, we might consider ‘the state of the universe’ as an initial condition, and ‘a decision theory with a utility function’ as dynamical laws which specify how an agent will react to any given state ^[5].

If we are convinced by this analogy, we might expect that some of the problems which arise from the dynamical laws approach in physics also present themselves in AI research. I will tentatively argue that they do and that they might be important when doing AI alignment research.

Both Constructor Theory and Alignment research seek to give us tools to reason about systems which we do not completely understand.

It is hoped that constructor theory will give us tools to reason about systems which are too complex to fully understand (in the same way that you do not need to know the torque on every gear of a complex machine to know that it won’t generate perpetual motion), by telling us that certain physical transformations are possible or impossible. It is also hoped that constructor theory will allow us to reason about systems where we do not know the dynamical laws which govern them, or to make predictions about systems which do not exist yet (for example, reasoning about hybrid quantum-gravitational systems). Similarly, one of the reasons that AI alignment is hard is because it involves reasoning about systems which do no exist yet (superintelligent AGIs) or systems which are too complex for a human to fully understand (like massive neural nets) ^[6]. In both cases, we do not know the ‘dynamical laws’ which will govern the system, but we still wish to reason about the system.

Both are interested in what transformations are possible or impossible within a system, just as much as knowing what will happen in any particular situation.

As previously noted, where the dynamical laws approach to physics tells you what will happen in a particular situation, constructor theory is concerned with formalising statements about what kinds of physical transformations are possible or impossible, regardless of whether they actually happen. Similarly, in AI alignment, we are concerned about what an AI actually does in training, but we are often more concerned about what it could do in a different domain. It is often hard to tell how the AI will behave in a different domain in advance, without just letting it go and seeing what happens. This issue is at the core of the distribution shift problem and the related threat model of the sharp left turn. In the distribution shift problem, an AI receives data which is differently distributed to its training data and takes actions which yielded good results on the training data, but do not yield good results on the new data. This reveals that the AI’s actions were not aligned with the goal of its creators. In the sharp left turn threat model, the primary argument is that capabilities generalize further than alignment. Broadly speaking, an AI may gain general capabilities from training in a limited domain, and its creators may be successful making sure that it remains aligned with their intentions in that domain. The capabilities gained by the AI extend to other domains, but the alignment methods used by its creators no longer apply in these domains. It would be helpful to be able to rule out certain behaviours as impossible, regardless of the domain that the AI finds itself in, in the same way that we can rule out all perpetual motion machines as impossible.

Both are concerned with emergent properties which are hard to capture from a reductionist, dynamical laws approach

As previously stated, one hope is that constructor theory will give us a new tool to reason about emergent features of the physical world, such as information and the second law of thermodynamics. I am using the word ‘emergent’ here to to mean ‘properties that do not appear at the level of (reductionist/microscopic) dynamical laws’. For example, you could look at the equations of motion for every particle of a classical gas and, due to Loschmidt’s paradox, you would not be able to derive the second law of thermodynamics from them alone. You will not find the concept of information at the level of dynamical laws, and yet it seems to be important to understand the universe. If I built an incredibly complex machine, with a billion gears and axels and claimed that it generated perpetual motion, you could write down a set of one billion equations in the dynamical laws picture which describe how every gear and axel will move over time. But just by looking at this complex set of equations, it would be difficult to tell whether I really had built a perpetual motion machine. It might require a lot of difficult calculation before you find out that the torque on the final gear is not large enough for the machine to work as claimed. Thankfully, you do not need to go through this process in order to know that the machine does not work, since we already have a principle in physics (or rather, a set of principles) which tell us that making a perpetual motion machine is impossible.

Similarly AI alignment is often concerned with high-level emergent properties of AI systems, such as corrigibility, deceptiveness and explainability . In practice, we characterise AIs by (for example) weights and biases, along with the algorithm which specifies how they update. Discerning the ‘higher-level’ properties, such as corrigibility or ‘will this AI kill everyone?’ is difficult if you are just looking at a billion parameter matrix equation. In the example of the perpetual motion machine we had a principle about the possibility/impossibility of certain tasks that allowed us to skip the process of solving a billion parameter equation and cut to the main ‘emergent’ result that we wanted. Might it be possible to do something similar in the AI case?

Summary Table

In this table, I have summarised, as I see it, the analogies between the problems in physics which constructor theory seeks to address, and the AI alignment problem.

	Physics	AI
Dynamical Laws	Newton’s equations (classical), Von Neumann equation (quantum)	A training algorithm, specifying how weights and biases update in response to inputs.
Initial Conditions	Phase space coordinates (classical), the initial quantum state vector (quantum)	Input/training data, weights and biases
Emergent Properties	Information, thermodynamics	Corrigibility, high-level interpretability
Case where we don’t understand the dynamical laws	Hybrid Systems	AIs that don’t yet exist
Attempt to formalise the concept of possibility/impossibility	Constructor Theory	???
Cases where the dynamical laws are too complex to fully analyse	A box containing an Avogadro number of interacting particles	Complex machine-learning systems

How Constructor Theory might apply to AI Alignment

On the basis of these analogies (which I found compelling though admittedly vague), I decided to investigate whether constructor theory could be of any use in tackling the alignment problem. In particular, I was interested in whether the analogy between the impossibility of certain tasks in physics, and the forbidding of certain actions being performed by an AI.

This investigation was not very fruitful and I did not obtain any interesting results (other than I enjoyed doing it and it improved my understanding of the field). In this section, I will explain the reasons why I found it hard to do anything concrete by applying constructor theory to AI alignment. I am not writing this because I want to discourage people from pursuing this avenue. I do not have much experience in alignment research, so it is perfectly possible that someone more clever and experienced than me might be able to do something useful with this approach. However, I would like now to identify the concrete roadblocks which prevented me from making progress in the last couple of months.

A Simple Example

I tried to find a simple toy example where I could apply the idea of having certain impossibility statements restrict the form of the laws by which an AI operates. I came up with one that demonstrates this (along with the problems it might cause). Warning: this example might be considered insultingly simple to anyone with machine learning experience. I apologise in advance, but don’t think that making it more complex adds anything. It is a single variable linear regression (which could be regarded as a neural net with one node and no activation function). One could extend this example to multiple variables with a more complex relationship to make it look more impressive, but this very simple example is more than sufficient.

The inputs and outputs of this model are both pairs of numbers between 0 and 1. The training data looks like this:

In ( $x$ )	Out ( $y$ )
0	0.01
0.1	0.09
$⋮$	$⋮$
0.8	0.81
0.9	0.89

In other words the input is (approximately) the same as the output. We assume that there is plenty of data in the range $0 < x < 0.9$ and it all follows this pattern. The variable $x$ actually ranges from 0 to 1, but we don’t have any training data for $x > 0.9$ . Also assume that $y > 0.95$ corresponds to a very bad outcome, which we want to avoid. In this example the ‘alignment problem’ is getting the model to perform well in the range $0 < x < 0.9$ while not returning an output larger than 0.95 for any other input.

If perform a linear regression (or ‘train’ our AI on this data), it will search for a function of the form $y = W x + b$ . Eventually, it will converge to the form $y \approx x$ , with $W \approx 1$ and $b \approx 0$ . It is perfectly possible, depending on the initial values of $W$ and $b$ and the training algorithm used to alter them, that during the entire training process, we would not encounter an output in the ‘danger zone’, $y > 0.95$ , so naively we might expect that this will never happen, and our neural net is safely aligned^[7]. We might then test our neural net with more data from the range $0 < x < 0.9$ and again find that it never gives an output greater than 0.95, which might deceive us into thinking that it it aligned. But if we feed the system $x = 0.99$ , we will get $y \approx 0.99 > 0.95$ , which, as specified earlier, corresponds to a bad outcome. This is, as far as I can see, the simplest possible example of a distribution shift—the AI was only trained and tested on data in the range $0 < x < 0.9$ , so wasn’t equipped to avoid our bad outcome when subjected to off-distribution data.

What would it look like if we tried to address this problem using an approach inspired by constructor theory?

Firstly, we want to make sure that, not only do outputs satisfying $y > 0.95$ not appear during training, but also that such outputs are impossible for any input $0 < x < 1$ . We wish to make an impossibility statement which will constrain the laws by which our AI evolves. This is different to the naive approach, where we (incorrectly) assumed that an output $y > 0.95$ was impossible because it didn’t appear in the training data. The impossibility of getting $y > 0.95$ can be realised by several different constraints on the ‘dynamical laws’, but here is a simple one: we specify that $W + b < 0.95$ (recall that $0 < x < 1$ ).

Training the AI to optimise $W$ and $b$ to predict the training data as accurately as possible, with the additional restriction that $W + b < 0.95$ will yield a model with $W \approx 0.9$ and $b \approx 0.04$ , which performs worse on the training data, but has the advantage that we can be sure that it is impossible for it to give us the bad outcome.

This is analogous to the way in which impossibility statements restrict the form of the dynamical laws of physics and prevent certain higher level processes happening. We have provided an impossibility statement (‘it is impossible for the total $W + b$ to be greater than 0.95’) which restricts the way in which the parameters of the neural net update (ie. changes the ‘dynamical laws’ of the system) which in turn prevents undesirable outcomes. This simple example demonstrates the kind of analogy I was hoping to draw between the two fields. However this example also highlights some ways in which this analogy is unsatisfactory..

Problem 1: The analogy between the two areas breaks down

The kinds of behaviour forbidden by the laws of physics are well-defined. It is possible to define clearly what a perpetual motion machine is, or a heat engine which violates the second law of thermodynamics. The fact that it is impossible to do these things rules out a vast number of complex machines from ever being built. But all of those impossible machines share a few simple features which can be characterised mathematically (eg. they ultimately convert heat entirely into work in a cyclic process) which means that we can rule them out. Similarly, in the toy example in the previous section, the AI behaviour that we wanted to forbid was simple and easily mathematically quantified (‘don’t let the output be greater than 0.95’).

However, as far as I can see it the kinds of behaviour we might wish to forbid in AI are often quite subtle and varied. The multitude of possible bad AI outcomes that we might wish to forbid do not share some easily identifiable common feature in the same way that impossible heat engines or perpetual motion machines do. It seems that if we could easily and mathematically characterise a common feature of all bad AI outcomes (as we assumed in the example), then we would have made significant progress on the alignment problem. If this could be done, then I suspect that ensuring that these outcomes are impossible within an AI system would be fairly straightforward, regardless of whether one uses an approach ‘inspired by constructor theory’ or not. In hindsight this is obvious: ‘it is hard to characterise exactly what you want an AI to do’ is just another statement of the alignment problem. Therefore, on this issue it seems to me that the analogy of AI alignment with physics and Constructor theory does not contribute anything helpful.

Problem 2: Constructor theory is not very developed and has very little in the way of tools which can be imported to alignment

Before I went into this project, I thought that constructor theory might have more concrete and general tools and theorems. In particular, I thought that there might be some general results elaborating the links between dynamical laws and impossibility statements. In the toy example, solving the problem was made easy for two reasons. First, as described in the previous section, the ‘bad outcomes’ which we wanted to avoid had an easy mathematical form. Second, due to the simplicity of the problem, it was clear to us how to make a link between the condition for bad outcome and the restrictions we could put on the laws of the system to make this impossible (ie. it was easy to spot that the condition $W + b < 0.95$ would give us the restriction that we needed). It is not clear to me that in most cases, even if we do have a mathematically explicit formulation of what makes a ‘bad outcome’, we will clearly be able to identify the conditions to put on the laws of the system in order to prevent the bad outcome. It seems plausible that, in most practical cases, the laws and the conditions will be too complex to find a simple link between them. I had hoped that constructor theory might have some general results related to this problem (finding links between impossibility statements and the dynamical laws which they restrict).

Instead, constructor theory has a few results which are restricted to quite specific parts of physics. I can find no concrete theorems from constructor theory which could apply to AI alignment. Instead, I have been using it as a vague philosophical framework for discussing AI. This is not to to undermine it (vague philosophical frameworks can be useful!) but I think that this highlights an important limitation in the approach that I tried to take.

Problem 3: Reforming the study of AI in terms of constructor theory would be as ambitious as reforming physics

Instead of importing a few tools here and there from constructor theory to help with alignment, one might be tempted to begin a project of re-framing all AI work in terms of possibility/impossibility statements. The implication behind a lot of work in constructor theory is that the ultimate goal is to reframe all of physics in terms of possibility/impossibility statements and that this might help us to avoid the problems with the dynamical laws approach. A large list of such statements would entirely characterise physics (or at least, that’s the idea). This list will be complete enough that it will entirely characterise all dynamical laws as a side-effect. Characterising physics in this way would, allegedly, eliminate the problems with the dynamical laws approach to physics. Could we re-frame AI in a similar way? If we write a list of things we want it to be possible (and impossible) for the AI to do, will the appropriate algorithm just appear as a side-effect of this list?

There are a couple of problems with this, the first of which is that it would be very ambitious and time-consuming (it was certainly too ambitious for me to attempt over the last couple of months). Secondly, similar to the previous example, this just seems to be passing the buck. If we could already find an easy way to characterise all undesirable AI behaviours, then maybe re-framing AI in terms of possibility/impossibility statements would work. But if we could do this, then it seems that we would already be most of the way to solving alignment, and again the analogy with constructor theory has not added anything helpful

Conclusions

I think that these problems and disanalogies between alignment and constructor theory are strong enough that I will not continue pursuing this agenda. The point of this project was to perform a low-cost exploration of a very speculative idea based on a couple of intuitions and see if it worked out. It didn’t. AI alignment is a hard problem, and I do not understand it very well, therefore I am not too surprised that my speculative approach, based on a idea from physics (which itself is pretty speculative) did not yield spectacular results.

Near the start, I quoted John Wentworth as saying ‘progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans’ and this is what I have tried to do in this post. A lot of the problems I came across in the last section seem obvious in hindsight and I feel a bit silly writing them explicitly. The fact that I feel like this (I think) is a good sign, as it at least shows that I understand the alignment problem a bit more clearly that I did at the start of the project. I hope to be back soon to share some more bad ideas.

This work was funded by a grant from ACX Grants.

^
In classical mechanics, initial conditions are specified by the initial phase space coordinates (ie. the positions and momenta of each particle in the system) and the dynamical laws which specify how they evolve over time are given by Newtons equations (or Lagrange or Hamilton’s equations). In quantum mechanics, the initial conditions are specified by a quantum state vector (or density operator) and the dynamical laws are specified by the Schrodinger equation (or the Von Neumann equation). General relativity can also be framed in this way, though with extra subtleties due to its treatment of time. This is not an area I know very well, so I will avoid using it example, as there are probably important aspects which I don’t understand properly.
^
The phase ‘modal statements’ or something similar might be an alternative for describing statements which refer to possibility or impossibility. I have chosen not to use this. The word ‘modal’ is normally used to refer to logical notions of possibility, whereas here we are concerned with physical possibility. I won’t claim that building a perpetual motion machine is logically impossible, but I’m pretty sure that, in this universe, it is physically impossible. I’m sure that the relationship between logical and physical possibility is very interesting, but I don’t want to get into it here.
^
For example, if I knew that a qubit was prepared in either the $| 0 >$ state or the $| 1 >$ , then it is possible to devise a protocol to clone the state (ie. measure in the ${| 0 >, | 1 >}$ basis and use the result to transform an auxiliary qubit to the same state).
^
The analogy is slightly different whether the model is being trained or deployed. For brevity I combined the two here.
^
It is less clear to me that this is a good analogy.
^
Obviously there are degrees of understanding. A human may understand everything about how an algorithm works, but that doesn’t mean that they will be able to perfectly predict its behaviour without running the algorithm. The engineers who made AlphaGo may understand it, but would not be able to predict its next move without running a full simulation of AlphaGo. Similarly, you might know all the laws of physics governing a machine but, for reasons of complexity, not be able to predict how it will behave. I will here use the word ‘understand’ loosely to mean ‘understand well enough to predict the future behaviour’.
^
Ok, I imagine most people can see where this is going, and can see that, when fed an input $x > 0.95$ , the AI will give an undesirable output. This example was deliberately chosen for its simplicity. In practice with large neural nets, we are not just able to ‘see’ whether they are aligned. We have to go by how they perform with the training data, which is what I was trying to capture with this example.