On predictability, chaos and AIs that don’t game our goals

I want to thank @Ryan Kidd, @eggsyntax and Jeremy Dolan for useful discussions and for pointing me to several of the relevant resources (mentioned in this post) that I have used for linking my own ideas with those of others.

Executive summary

Designing an AI that aligns with human goals presents significant challenges due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions. This post explores these challenges and connects them to existing AI alignment literature, emphasizing three main points:

  1. Finite predictability in complex systems: Complex systems exhibit a finite predictability horizon, meaning there is a limited timeframe within which accurate predictions can be made. This limitation arises from the system’s sensitivity to initial conditions and from the complex interactions within it. Small inaccuracies in initial conditions can lead to significant deviations over time, making long-term predictions inherently unreliable.

  2. Hidden variables or inaccurate initial conditions: The precision of initial conditions directly impacts the predictability of complex systems. This is particularly relevant given that achieving perfect precision is practically impossible due to measurement errors, environmental noise, and unknown variables. The two sources of errors (missing variables and inaccurate initial conditions) can get confounded and blur what variables are necessary for accurate predictions.

  3. AI optimization and loss functions: AI systems that are based on Reinforcement Learning optimize specific loss or reward functions that often simplify complex goals, leading to potential “gaming” of these objectives. This means AI might exploit unintended pathways to meet specified criteria, neglecting broader, more nuanced aspects of the intended goals.

Taking these together, I argue that it seems practically impossible to design AI that does not eventually game its goals. This is because unpredictability poses a fundamental challenge for RL-based AI systems, which only care about the specified reward function—and, by 1 and 2 - this cannot be completely specified.


Introduction

Designing AI that aligns with human goals is challenging for many reasons explored elsewhere. However, in this post I argue that, due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions, an extra layer of difficulty arises in the context of outer alignment.

By examining the interplay between finite predictability, initial conditions, and the optimization of loss functions[1], we aim to understand why designing non-gaming AI seems practically impossible and discuss potential strategies to mitigate these issues.


Finite predictability in complex systems

Complex systems are those characterized by numerous interacting components (no matter how simple these constituents or interactions are). Typical examples of these are: electrical power grids, a network of airports, the immune system or, perhaps the quintessential example, the human brain. For this post, we are interested in a well-known phenomenon that these systems exhibit: there is a finite time horizon within which their future configuration can be predicted. In other words, this horizon is defined by the time frame within which we can accurately predict the system’s behavior, given our understanding of initial conditions and the system’s dynamics.

Crucially, however, the precision with which we can specify initial conditions directly influences how forecastable the future of these systems are. As predicted from Chaos Theory, the more accurately we know the initial conditions, the further into the future we can predict the system’s behavior. However, achieving perfect precision is practically, and theoretically due to quantum uncertainty, not achievable. Real-world scenarios are fraught with measurement errors, environmental noise, and unknown variables, all contributing to the difficulty of specifying exact initial states.

Due to the inherent sensitivity to initial conditions and the complex interactions, even slight inaccuracies can lead to significant deviations in predictions over time. This concept, often illustrated by the popularized “butterfly effect,” highlights how small changes in initial conditions can lead to vastly different outcomes, underscoring the challenge of modelling and forecasting real-world scenarios accurately.

All of this means that, in the context of complex systems, the inability to predict future outcomes can fundamentally arise from two primary sources. Firstly, it can be attributed to not accounting for all relevant variables, a scenario that is extensively recognized within the AI community. Secondly, it can result from a lack of precision in the initial conditions, which becomes increasingly problematic as these minor inaccuracies amplify over time. This degeneracy will be particularly relevant a bit later in the post.


AI optimization and loss functions

AI systems that are based on RL are designed to optimize specific loss or reward functions, guiding the AI’s behavior. However, these functions simplify complex goals, omitting factors crucial for achieving desired outcomes. As a result, AI may find ways to “game” these goals, optimizing for the specified criteria while neglecting broader, unintended consequences. This phenomenon, known as “reward hacking”, has been extensively explored elsewhere (for example, here, here or here), so I will not get into all the details. However, for completeness, I will just mention that this idea illustrates how AI can exploit loopholes or unintended strategies to satisfy formal criteria without achieving the underlying intent. And this is no bueno.

I believe that the issue of AI gaming its goals is akin to the problem of world-model mismatches, where AI systems might have different understandings or representations of the world compared to humans, leading to misaligned actions. Ensuring that AI systems can align their models with human intentions is crucial for avoiding these mismatches.

What is, to my mind, important for this post is to highlight that RL-based systems only care about what is explicitly included in their reward function. The relevance of this will become more apparent in the following section.


Why do we care about these issues for safe AI design?

Okay, enough with the big picture concepts. Is this discussion any good for advancing AI Safety? I believe it is. Let’s see some concrete examples:

In this post, when caveating their proposed approach, the author states:

Firstly, our model may not be capable enough to learn the human likelihood/​prior functions, even given plenty of IID examples.

I would go a step before and argue that by my points 1 and 2, humans themselves cannot learn the right likelihood functions due to how nature seems to work, not due to a human limitation. One could of course argue that our science is limited by our cognitive capacity, but I see no way out of this logical loop. In any case, there is no evidence (that I am familiar with) that suggests that machines would be better at this than humans. Then, they go on and say:

Optimising z (text instructions on how to label images) is hard; we’d probably need a better way of representing z and exploring the space of zs than just searching over long strings of text. One way to improve might be to have our human labellers generate different hypotheses for what different breeds look like, then train a model to imitate this hypothesis generation.

Again, this assumes that we can generate the right set of hypothesis for the model to learn how to imitate. Same reasoning as before, by 1 and 2 I am skeptical this can actually happen, specially in tasks that are not solved already. In the comments of that same post, @Richard_Ngo points out:

It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.

Which I definitely agree with. Nevertheless, precisely because of what I argue in this post, science tells us that AIs cannot help in these scenarios either. Building good priors necessarily means taking all relevant variables into account and specifying their initial conditions to the best extent possible.

Later in the comment by the same author:

_ z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z._

Even if I mainly agree with this one, I’d say that, in principle, these complicated interactions between components of z might be unknowable (as before, because of 1 and 2).

As another relevant example, imagine that we are trying to train a model to explain its predictions. Then, a reason why we can get a misaligned model is that:

true explanations might be less convincing than false explanations sometimes if the latter seem more plausible due to incomplete data, or human biases and mistakes.

Which is particularly relevant in this case, given that the true explanation of why a given forecast is inaccurate might entail an ever-increasing level of complexity. Namely, it may well be that we did not specify our initial conditions with enough accuracy, that we did not include all relevant variables for optimization, or both.

Finally, in this post (in the context of comparing human and AI’s world models) the author states:

Of course in practice the human may make errors and will have cognitive limitations. But if we use the kinds of techniques discussed in Teaching ML to Answer Questions Honestly, we could hope to learn something like HumanAnswer instead of the human’s approximation to it.

I argue we should be more pessimistic: in real-world scenarios, where there are multiple interacting elements and non-linear dynamics in place, there are no a priori reasons to believe that a simple world model (or even a loss function) can be useful to predict anything after a short time horizon. And this time horizon shrinks as we include more complex dynamics in the mix.


Conclusion

I would formulate this problem as follows: given the inherent unpredictability of complex systems and the difficulty in specifying comprehensive loss functions, it seems practically impossible to design an AI that doesn’t eventually game our goals. The combination of finite predictability, imperfect initial conditions, and simplified loss functions means that not all relevant factors can be accounted for, making it likely that AIs will exploit unintended pathways to achieve the goals we have stated. Thus, I believe that this is another challenge to be added to the outer alignment framework.

Hopefully by factoring in these potential challenges, we can, as a field, figure out a way to address them and get a bit closer to achieving safer AI systems.


  1. ↩︎

    I think it would be interesting to explore the link between some of these ideas and the ones explored in this other post.