On predictability, chaos and AIs that don’t game our goals
I want to thank @Ryan Kidd, @eggsyntax and Jeremy Dolan for useful discussions and for pointing me to several of the relevant resources (mentioned in this post) that I have used for linking my own ideas with those of others.
Executive summary
Designing an AI that aligns with human goals presents significant challenges due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions. This post explores these challenges and connects them to existing AI alignment literature, emphasizing three main points:
-
Finite predictability in complex systems: Complex systems exhibit a finite predictability horizon, meaning there is a limited timeframe within which accurate predictions can be made. This limitation arises from the system’s sensitivity to initial conditions and from the complex interactions within it. Small inaccuracies in initial conditions can lead to significant deviations over time, making long-term predictions inherently unreliable.
-
Hidden variables or inaccurate initial conditions: The precision of initial conditions directly impacts the predictability of complex systems. This is particularly relevant given that achieving perfect precision is practically impossible due to measurement errors, environmental noise, and unknown variables. The two sources of errors (missing variables and inaccurate initial conditions) can get confounded and blur what variables are necessary for accurate predictions.
-
AI optimization and loss functions: AI systems that are based on Reinforcement Learning optimize specific loss or reward functions that often simplify complex goals, leading to potential “gaming” of these objectives. This means AI might exploit unintended pathways to meet specified criteria, neglecting broader, more nuanced aspects of the intended goals.
Taking these together, I argue that it seems practically impossible to design AI that does not eventually game its goals. This is because unpredictability poses a fundamental challenge for RL-based AI systems, which only care about the specified reward function—and, by 1 and 2 - this cannot be completely specified.
Introduction
Designing AI that aligns with human goals is challenging for many reasons explored elsewhere. However, in this post I argue that, due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions, an extra layer of difficulty arises in the context of outer alignment.
By examining the interplay between finite predictability, initial conditions, and the optimization of loss functions[1], we aim to understand why designing non-gaming AI seems practically impossible and discuss potential strategies to mitigate these issues.
Finite predictability in complex systems
Complex systems are those characterized by numerous interacting components (no matter how simple these constituents or interactions are). Typical examples of these are: electrical power grids, a network of airports, the immune system or, perhaps the quintessential example, the human brain. For this post, we are interested in a well-known phenomenon that these systems exhibit: there is a finite time horizon within which their future configuration can be predicted. In other words, this horizon is defined by the time frame within which we can accurately predict the system’s behavior, given our understanding of initial conditions and the system’s dynamics.
Crucially, however, the precision with which we can specify initial conditions directly influences how forecastable the future of these systems are. As predicted from Chaos Theory, the more accurately we know the initial conditions, the further into the future we can predict the system’s behavior. However, achieving perfect precision is practically, and theoretically due to quantum uncertainty, not achievable. Real-world scenarios are fraught with measurement errors, environmental noise, and unknown variables, all contributing to the difficulty of specifying exact initial states.
Due to the inherent sensitivity to initial conditions and the complex interactions, even slight inaccuracies can lead to significant deviations in predictions over time. This concept, often illustrated by the popularized “butterfly effect,” highlights how small changes in initial conditions can lead to vastly different outcomes, underscoring the challenge of modelling and forecasting real-world scenarios accurately.
All of this means that, in the context of complex systems, the inability to predict future outcomes can fundamentally arise from two primary sources. Firstly, it can be attributed to not accounting for all relevant variables, a scenario that is extensively recognized within the AI community. Secondly, it can result from a lack of precision in the initial conditions, which becomes increasingly problematic as these minor inaccuracies amplify over time. This degeneracy will be particularly relevant a bit later in the post.
AI optimization and loss functions
AI systems that are based on RL are designed to optimize specific loss or reward functions, guiding the AI’s behavior. However, these functions simplify complex goals, omitting factors crucial for achieving desired outcomes. As a result, AI may find ways to “game” these goals, optimizing for the specified criteria while neglecting broader, unintended consequences. This phenomenon, known as “reward hacking”, has been extensively explored elsewhere (for example, here, here or here), so I will not get into all the details. However, for completeness, I will just mention that this idea illustrates how AI can exploit loopholes or unintended strategies to satisfy formal criteria without achieving the underlying intent. And this is no bueno.
I believe that the issue of AI gaming its goals is akin to the problem of world-model mismatches, where AI systems might have different understandings or representations of the world compared to humans, leading to misaligned actions. Ensuring that AI systems can align their models with human intentions is crucial for avoiding these mismatches.
What is, to my mind, important for this post is to highlight that RL-based systems only care about what is explicitly included in their reward function. The relevance of this will become more apparent in the following section.
Why do we care about these issues for safe AI design?
Okay, enough with the big picture concepts. Is this discussion any good for advancing AI Safety? I believe it is. Let’s see some concrete examples:
In this post, when caveating their proposed approach, the author states:
Firstly, our model may not be capable enough to learn the human likelihood/prior functions, even given plenty of IID examples.
I would go a step before and argue that by my points 1 and 2, humans themselves cannot learn the right likelihood functions due to how nature seems to work, not due to a human limitation. One could of course argue that our science is limited by our cognitive capacity, but I see no way out of this logical loop. In any case, there is no evidence (that I am familiar with) that suggests that machines would be better at this than humans. Then, they go on and say:
Optimising z (text instructions on how to label images) is hard; we’d probably need a better way of representing z and exploring the space of zs than just searching over long strings of text. One way to improve might be to have our human labellers generate different hypotheses for what different breeds look like, then train a model to imitate this hypothesis generation.
Again, this assumes that we can generate the right set of hypothesis for the model to learn how to imitate. Same reasoning as before, by 1 and 2 I am skeptical this can actually happen, specially in tasks that are not solved already. In the comments of that same post, @Richard_Ngo points out:
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.
Which I definitely agree with. Nevertheless, precisely because of what I argue in this post, science tells us that AIs cannot help in these scenarios either. Building good priors necessarily means taking all relevant variables into account and specifying their initial conditions to the best extent possible.
Later in the comment by the same author:
_ z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z._
Even if I mainly agree with this one, I’d say that, in principle, these complicated interactions between components of z might be unknowable (as before, because of 1 and 2).
As another relevant example, imagine that we are trying to train a model to explain its predictions. Then, a reason why we can get a misaligned model is that:
true explanations might be less convincing than false explanations sometimes if the latter seem more plausible due to incomplete data, or human biases and mistakes.
Which is particularly relevant in this case, given that the true explanation of why a given forecast is inaccurate might entail an ever-increasing level of complexity. Namely, it may well be that we did not specify our initial conditions with enough accuracy, that we did not include all relevant variables for optimization, or both.
Finally, in this post (in the context of comparing human and AI’s world models) the author states:
Of course in practice the human may make errors and will have cognitive limitations. But if we use the kinds of techniques discussed in Teaching ML to Answer Questions Honestly, we could hope to learn something like HumanAnswer instead of the human’s approximation to it.
I argue we should be more pessimistic: in real-world scenarios, where there are multiple interacting elements and non-linear dynamics in place, there are no a priori reasons to believe that a simple world model (or even a loss function) can be useful to predict anything after a short time horizon. And this time horizon shrinks as we include more complex dynamics in the mix.
Conclusion
I would formulate this problem as follows: given the inherent unpredictability of complex systems and the difficulty in specifying comprehensive loss functions, it seems practically impossible to design an AI that doesn’t eventually game our goals. The combination of finite predictability, imperfect initial conditions, and simplified loss functions means that not all relevant factors can be accounted for, making it likely that AIs will exploit unintended pathways to achieve the goals we have stated. Thus, I believe that this is another challenge to be added to the outer alignment framework.
Hopefully by factoring in these potential challenges, we can, as a field, figure out a way to address them and get a bit closer to achieving safer AI systems.
- ↩︎
I think it would be interesting to explore the link between some of these ideas and the ones explored in this other post.
As I understand it, the argument above doesn’t account for the agent using the best information available at the time (in the future, relative to its goal specification).
I think there is some confusion around a key point. For alignment, do we need to define what an agent will do in all future scenarios? It depends what you mean.
In some sense, no, because in the future, the agent will have information we don’t have now.
In some sense, yes, because we want to know (to some degree) how the agent will act with future (unknown) information. Put another way, we want to guarantee that certain properties hold about its actions.
Let’s say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent’s decision rule in a way that doesn’t rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). See also the adjacent by comment about a thermostat by eggsyntax.
Note: I’m saying “decision rule” intentionally, because even an individual human does not have a well-defined utility function. (edited)
This makes intuitive sense to me! However, for concreteness, I’d pushback with an example and some questions.
Let’s assume that we want to train an AI system that autonomously operates in the financial market. Arguably, a good objective for this agent is to maximize benefits. However, due to the chaotic nature of financial markets and the unpredictability of initial conditions, the agent might develop strategies that lead to unintended and potentially harmful behaviours.
Questions:
Would the short-term strategy be useful in this case? I don’t think it would, because of the strong coupling between actors in the market.
If we were to use the definition of “doing what we would want, provided that we were in its shoes”, I’d argue this agent would basically be incapable of operating, because we do not have examples in which humans can factor in so much potentially relevant information to make up their minds.
Hmm, I think my argument also applies to this case, because the “best information available at the time” might not be enough (e.g., because we cannot know whether there are missing variables, lack of precision in the initial conditions, etc). I think the only case in which this is good enough, I’d say, is when the course of action is within the forecastable horizon. But, in that case, all long-term goals have to be able to be split into much smaller pieces, which is something I am honestly not sure can be done.
I’d be interested in hearing why these expectations might not be well calibrated, ofc!
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
With outer alignment I was referring to: “providing well-specified rewards” (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what’s relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned (“one can construct a desirability tree over various possible various future states.”).
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what’s it good for?
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
Completely agreed!
On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913
Want to try out a thought experiment? Put that same particular human (who wanted to specify goals for an agent) in the financial scenario you mention. Then ask: how well would they do? Compare the quality of how the person would act versus how well the agent might act.
This raises related questions:
If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Thanks for this pointer!
[Pasting in some of my responses from elsewhere in case they serve as a useful seed for discussion]
I would say that 3 is one of the core problems in reinforcement learning, but doesn’t especially depend on 1 and 2.
How about a thermostat? It’s an awfully simple AI (and not usually created via RL, though it could be), but I can specify my goals in a way that won’t be gamed, without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.
To be clear, I definitely think that as loss functions become more complex, and become less perfect proxies for what you really want, you absolutely start running into problems with specification gaming and goal misgeneralization. My goal with the thermostat example is just to point out that that isn’t (as far as I can see) because of a fundamental limit in how precisely you can predict the future.
Alejandro: Fair point! I think I should’ve specified that I meant this line of reasoning as a counterargument to advanced general-purpose AIs, mainly.
I suspect the same argument holds there, that the problems with 3 aren’t based on 1 and 2, although it’s harder to demonstrate with more advanced systems.Here’s maybe one view on it: suppose for a moment that we could perfectly forecast the behavior of physical systems into the future (with the caveat that we couldn’t use that to perfectly predict an AI’s behavior, since otherwise we’ve assumed the problem away). I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don’t have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.(unless the caveated point is exactly what you’re trying to get at, but I don’t think that anyone out there would say that advanced AI is safe because we can perfectly predict the physical systems that include them, since in practice we can’t even remotely do that
I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn’t game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the things you care about (predictive variables) is due to the inaccuracy attribution degeneracy that I mention in the post. In other words, you don’t (and possibly can’t) know if the variable you’re interested in predicting isn’t being accurately forecasted because a lack of relevant things to be specified (most common case) or due to misspecified initial conditions of all the relevant variables.
I partially agree: I’d say that, in that hypothetical case, you’ve solved one layer of complexity and this other one you’re mentioning still remains! I don’t claim that solving the issues raised by chaotic unpredictability solve goal gaming, but I do claim that without solving the former you cannot solve the latter (i.e., solving chaos is a necessary but not sufficient condition).