Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Multi-Armed Bandits Considered Harmful

People frequently analyze the process of artificial agents gathering knowledge in the framework of explore/exploit strategies for multi-armed bandits. However, a multi-armed bandit is a simplistic black-box abstraction – the possible rewards from pulling each arm have no underlying logic: by definition they’re unknown and unknowable other than by repeatedly sampling them. Treating a learning experience like it’s a multi-armed bandit on which the best you can do is explore/exploit is an extremely simple strategy, implementable by even a simple reinforcement learning agent — but it’s an extremely bad strategy in the presence of either possible outcomes that permanently end your opportunity to learn, or produce very large negative rewards. That’s why people who train current reinforcement learning systems on tasks like Atari games, or anything else that involves instant-death situations like lava pits, normally give them unlimited lives. Any real-world agent with only one life that approaches a novel environment containing potential death traps or fatal mistakes using an explore-exploit strategy is soon going to wind up destroyed.

The real world is (almost always) not as cruel and unknowable as a multi-armed bandit. For example, in the real world, lava pits radiate large amounts of radiant heat, glow red, and produce odd smells. Even an agent that has never encountered lava before and knows nothing about it, but has encountered, say, campfires, or candles, can observe “That seems similar to a much more intense version of a fire. I once burnt one of my manipulators in a fire—it hurt, and I needed to have it repaired. I don’t know what this is, but it seems like a plausible hypothesis that it might be something like a large, intense fire. If so, it could well be very dangerous. So I should stay well away from it. If for some reason it would be worth testing this hypothesis, I should find a safe way to test it. For example, I could go find a twig, bring it here, throw it into this thing from a hopefully-safe distance, and see what happens to the twig — if this thing is like a fire, it will likely set the twig on fire.” This is clearly a far better strategy for handling novel lava pits that a naive explore/exploit strategy of “I shall jump into it several times and see what happens, enough times to obtain a decent statistical sample of how lethal it is”.

General thinking on AI risk is that the most obvious danger is from agents that use goal-oriented consequentialism, do some finite-resource approximation of Solomonoff induction, have capabilities at least comparable to human in most respects and superior in at least some, and have experience in an environment that bears some resemblance to the real world. Anything that has those four proprieties is obviously not going to be still using a naive explore/exploit strategy and treating the real world like a multi-armed bandit (and if it was, my money would be the humans in any ensuing conflict). So using analysis based on multi-armed bandits or naive explore/exploit when considering how a GAI might act is invalid — it’s about as helpful as using models of insect behavior to try to predict what a human scientist would do.

The Optimizer’s Curse, and How to Break It

For an AI design that does an approximation of Solomonoff induction, like AIXI or a value learner, it’s necessary for the agent to attempt to optimize its actions without (yet) having a function that accurately maps actions to their utility — for AIXI, because it cannot entirely predict what state of the universe an action will produce, it has something like a current weighted ensemble of hypotheses that each predict a probability distribution for resulting states, and for a value learner this is further composited with a current weighted ensemble of hypotheses about the value function that it’s trying to learn to map states of the universe to utilities. Either way, the net result is that each action has a currently hypothesized probability distribution of utility values.

An algorithm I have frequently seen proposed, or at least assumed, is to treat this distribution of utility values as if it were simply its average, giving you a current estimated average utility function, averaged over both current hypothesis uncertainty and any stochasticity inherent to individual hypotheses, and optimize that. This algorithm is a poor strategy, and suffers from what is often called the “optimizer’s curse”. If you have a large number of possible actions (or possible goals for a planned sequence of actions) and optimize using this strategy, you will usually find that the action/goal you select is not the true optimum, but is instead very likely to be a somewhat worse action whose true utility you had overestimated due to hypothesis errors. This is the result of what statisticians call the “look elsewhere effect” — if you draw a large number of samples from random distributions, you will sooner or later encounter some really extreme samples. (In real life, this is exacerbated by the fact that even most distributions that are generally well modeled by a normal distribution may have rare extremal instances that are drawn from some other, often fatter-tailed, distribution: very occasionally, your experiment gets hit by a cosmic ray, or a truck, or is stolen, or otherwise misbehaves.) If at least portions of your current ensemble of hypotheses about how the world works (and for a value learner, also your current ensemble of hypotheses about the true utility function on it) bears any resemblance to reality, this behavior was often fairly predictable, if you look at the full distribution not just its average. The distribution contains more information than just its average, and some of this information is useful for anticipating likely mistakes.

For example, consider two actions, A and B. I am almost 100% certain that A has a utility of +1 — I believe I have a firm conceptual understanding of why, the concepts are very-well-tested, I’ve done things I believe to be comparable many times before, and their payoff was always very predictable. Whereas for B I estimate an ~1% probability of truth for a hypothesis that states that it’s lethal, with a utility of −1,000,000,000, and I’m ~99% sure of a hypothesis that states that it’ll be marvelous, with a utility of +10,000,002, so its average utility is ~1.98. Should I do A, or B? If I just look at the average utility and optimize that, a score of ~2 for B beats ~1 for A, so I do B. But any agent that handles exploring the universe in a way smarter than treating it as a multi-armed bandit is going to tell you that B is a huge gamble, with enormous variance on its utility, and that its model of the likelihood of hypotheses would only have to be off by one part in 10^9 to flip the sign on whether or not it’s a good idea. So a far, far better strategy is to do A, take the safe reliable payoff, note that B had a potentially very large positive or negative payoff, and consider whether you can find some way to get (potentially very valuable) better information about the uncertainty between hypotheses that models the actual payoff of B by some means that involves taking a much smaller risk than the explore/exploit option of just trying B and seeing what happens as if the universe was just a multi-armed bandit. Note that if you were instead 100% certain of a hypothesis that suggests that action B has a 1% probability of producing as state with −1,000,000,000, and a 99% change of producing a state with utility +10,000,002, and you are certain both of the exactness of this numbers and probabilities and there is no way to alter those even slightly — so if the stochasticity is true quantum randomness, not any approximate way of describing ignorance about previous condition, then there is no further experimentation suggested: the gamble can’t be improved, all you can do is take it, or not.

So, the distribution of utilities for an action contains a lot more information than just the mean, and there are better optimization strategies available than just optimizing its mean. If you make the assumption that the agent’s estimates for the distribution of utilities are at least approximately correct (if not, the agent is deeply confused and finding an even a survival strategy is going to be mostly a matter of being lucky enough to survive until the approximately Bayesian updates of the agent’s approximately Solomonoff induction make the assumption true again), then optimizing the hypothesized mean is basically an all-explore strategy for sequentially finding all the cases where the agent had in fact overestimated the true utility, trying them all in sequence, and being repeatedly disappointed until its hypotheses get corrected enough to fix all the problems. I.e. optimizing the mean of the utility distribution is just an all-explore strategy for the hypothesis space, and is just as dumb for an intelligent agent as an all-explore strategy would be in any other context. That’s the optimizer’s curse, and any GAI agent should be using a better strategy (assuming one exists).

So, what would a better strategy look like? Not necessarily a perfect one, but at least one that didn’t generally suffer from the optimizer’s curse? Consider the following straw-man proposal:

Look at the set of actions you are optimizing over, or if you’re picking a goal for a series of actions (and are not going to have useful opportunities to learn more and reassess as you move along the path to that goal), the set of possible goals.
Look at the weighted ensemble of hypotheses, or whatever, that you are using to produce a distribution of possibly relative utilities for that set of action/goals, and analyze the underlying hypothesis uncertainties in it — in the sense of ‘I’m uncertain where to use hypotheses X or Y’ but NOT in the sense of ‘hypothesis Y suggests that the stochastic behavior of the universe will produce this distribution of possible world states/utilities, due to true randomness rather than to lazily modeling potentially-remediable ignorance about the world as stochastic behavior’ (unless hypothesis Y is actually a set of hypotheses with some free parameters that you have uncertainty over the values of, in which case the uncertainty over the values of the free parameters counts as hypothesis-uncertainty, but within-hypthesis-with-fixed-parameters true randomness does not). I think the actual criterion for whether uncertainty counts is “Do I believe (taking my capacity limitations into account) that I could ever do a Bayesian update on my hypothesis-set that would reduce this uncertainty and make it become more predictable in future situations like this?” Approximately decompose this relative-utility hypothesis-uncertainty into a space of some number of independent variables.
Some of these variables will be more important (in the sense of ‘varying the result of the optimization’) than others. Approximately estimate the effective number (I believe the statistics for this will look like the usual formula for computing the uncertainty of the sum of many separate approximately uncertainties, i.e. in the normal-distribution approximation variances add, but please consult a real statistician.)
Find a workable heuristic to throw in some extra uncertainty for “I’m using a computationally-limited approximation to Solomonoff induction, and as a result the truth might be stranger or more complex that any hypothesis that I’m currently considering, in a way that matters” – an informed heuristic for this might include things like how long it’s been since your approximation to Solomonoff induction last made a sequence of surprising Bayesian updates that looked like it was recovering from some bad behavior caused by this computationally-limited approximation (one might call this “time since last paradigm shift”), in any area with any resemblance or relevance to the matters involved in this decision.
Call this number C, for caution factor.
For each state/goal that you are optimizing across, find the 1-in-(C+2) pessimistic lower bound of its utility (e.g. if you have absolutely no uncertainty, so C is 0, that’s the 1-in-2 lower bound, i.e. the median). [I strongly suspect this should in fact be the mean not the median, but I’m unsure how to interpolate in a parametrizable way between the mean and a lower confidence bound — this straw-man proposal can almost certainly be improved.]
Optimize that function. (This has the net effect of penalizing options with higher hypothesis uncertainty, with a strength proportional to the hypothesis uncertainty for the set of options you are optimizing over, specifically the downside uncertainty around a plausible distance out into the tail.)

So, for example, if you are optimizing across 500,000 different possible goals (not counting ones so clearly bad on all of your plausible hypotheses agree that they have no chance of winning), and there is no more inherent structure to your uncertainty about their relative utility than them all being independent variables with different means and standard deviations (at least until you get well outside the set of previously-observed states of the universe, into regions where previously-unobserved effects could plausibly kick in at any point, this is a rather extreme situation, almost as implausible as a mutual-armed-bandit — generally you’d expect a significantly smaller hypothesis-uncertainity structure), then that gives ~1,000,000 at step 2. Depending on how widely their means and standard deviations vary, some of these may dominate, reducing the number at step 3, but if they don’t (and if step 4 doesn’t throw in much more uncertainty — we already have a great deal), then by step 5 you still have a number ~1,000,000, so in Step 6 you would be optimizing the 99.9999% lower confidence bound of your estimate of their utility.

Returning to our two actions A, with a reliable payoff of +1, and B with an ~1% probability of −1,000,000,000, and an ~99% probability of +10,000,002, where is variation is because of hypotheses variation, not something I believe to be true randomness. The 99% lower confidence bound or better on the utility of B is −1,000,000,000, so the optimizer’s-curse-avoiding strategy above will (very definitively) pick A over B, even if C was only 100.

That proposed strategy was a bit hand-wavy in places — I’m quite certain a professional statistician could flesh it out to design a better one (and any statistician reading this, I’d love to hear you do so). My point here is not to attempt to exactly define the optimal strategy for beating the optimizer’s curse, but to point out that it’s possible to do significantly better that just optimizing the mean of the distribution, and that a better strategy is going to involve focusing on the (likely fat-tailed) downsides of the distributions for possibly utility of states caused by hypotheses-uncertainty. Or, to put it in more human-recognizable terms, when doing a complex planning task choosing between many options where you don’t have leeway for changing your mind part-way through the plan, avoiding disappointments caused by overconfidence in the accuracy of your beliefs requires cautiously paying a lot of attention to even small possibilities that you could be being overoptimistic about something relevant to the planning process. To human ears, that sounds like an obvious truism about caution and intellectual humility — and it should to any rational GAI too, especially one capable enough to be able to produce a risk of an AI apocalypse.

The proposed strategy, or any better version a statistician may propose, is, obviously, significantly more computationally expensive than just optimizing the mean of the utility distribution — just as thinking like a scientist is more cognitively much more expensive that blindly acting like an explore/exploit learner. It also puts more stress on your resource-limited approximation to Solomonoff induction — if you’re frequently interested in even 99.99% lower bounds on confidence across currently-plausible hypotheses, you should probably be keeping an ensemble of currently-most-plausible hypotheses about how the universe works down to ones with an 0.01% current plausibility (or maybe fewer hypotheses each with some number of free parameters that you also have plausible distribution hypotheses over, for example). Most humans fail to do this – but then, we’re an evolved intelligence that’s only a quarter-million years past the “cultural knowledge transmission plus capability of generalization far out of prior distribution” threshold that marks the beginning of homo sapiens — let’s call that ‘sapience’, for lack of a better term. At least in terms of evolutionary timescales humans only just became sapient, so it’s unsurprising if we’re not very good at it, and especially so individually rather than in small groups. Humans generally solve the task of keeping tracks of many hypotheses in parallel by storing different ensembles of hypotheses in the brains of different humans, and using means like science and debate to figure out which ones are more accurate. Nevertheless, most humans, if faced with a choice between a reliable payoff of +1 for A, or a highly uncertain payoff for B that some people claim its really good and others say is catastrophic, would tend to try to find out more about B and why there’s debate about it before trying it, not just blindly gamble on who was right.

So, when people think about GAI risks, could they please consider the possibility (I would argue the high probability) that the GAI will be smart enough to know about the optimizer’s curse, and to have some reasonably effective heuristics or strategies for trying to avoid it? Maybe not perfect ones, but at least ones better than ‘optimize the mean of the distribution’, because that strategy is dumb — it’s just ‘always explore the utility-upside mistakes in your hypothesis ensemble confidences’. Strategies to resist the optimizer’s curse exist, and even rational humans have them, so any plausible GAI should too. It might even have better-than-human strategies — humans don’t generally handle either small abstract risks or being mistaken very well.

The Overton-Window Analogy

The strategy described above has the net effect effect that when optimizing, the agent penalizes goals unless it has a high confidence that they are good. The goals of those strategies aren’t necessarily aimed at producing states where the agent had actually seen exactly that state before (what one might call “staying inside the training distribution”) but it does require that the agent has not been able to think of any hypothesis (that isn’t much higher in Kolmogorov complexity than any of its currently favored hypotheses) even marginally consistent with its previous observations that would cause its predictions about the possible goal to be an overestimate. So they’re states close enough to the previously-observed distribution that the agent has a high conceptual confidence that is can correctly extrapolate to them. This is a high bar, at least when dealing with systems whose behavior is too complex to be simulated very accurately (say, anything much less predictable than Newtonian mechanics in a vacuum ) — it’s very often possible to come up with a plausible hypothesis for why a new state any significant distance out-of-previous-distribution could enter a new regime where a new previously-unobserved effect starts to becomes significant (what one might call “here-be-dragons” hypotheses). This is even more true for value learner, since even if the mapping between its action and the resulting physical state is something very predictable, the mapping from a physical state to a utility is always going to involve humans, who are very complex systems.

The result of this is analogous to the political phenomenon called “the Overton Window” — in politics, at any point in time there is usually only a only a certain range of options that are widely considered ‘reasonable’ and get any serious discussion, but this range can change as society learns more or the argument progresses. Similarly, any optimizer that has solved the optimizer’s curse is going to have only a certain range of states or goals that it seriously considers optimizing over, ignoring (or discounting heavily enough that they almost always lose) all the ones that it’s not yet sufficiently certain it can predict accurately. However, this window can shift — specifically, generally it will expand, as the agent gains more knowledge about hypotheses about the world by approximately Bayesian-updating its approximation to Solomonoff induction. In particular, the agent can identify possible goals outside the current “Overton window”-like optimization region that might be better, identify which uncertainties between hypotheses would need to be reduced to make the calculation accurately enough to expand its Overton-window-analog to cover that possible goal, think about cheap & safe ways to test that difference in hypotheses, and reason about whether the cost/risk of doing the experiment is worth it for the potential gain if the more optimistic hypothesis wins times its current best guess of the chance of that happening, discounted appropriately for the fact that the decision to even think about performing this experiment was made in a way that does suffer from the optimizer’s curse — it could be caused by you having overestimating the likelihood of some hypothesis that is optimistic about this goal.

GAIs and Existential Risks

Any GAI we build is likely capable of causing an existential risk, so it absolutely needs to be capable of thinking well about existential risks: possibilities with utilities so astronomically negative that if they happen even once, it’s all over. For a value learner, the existential risk is the human race going extinct (and the species not being resurrectable). Or, from the point of view of a paperclip maximizer, it going extinct is an existential risk — and so is it inadvertently building a smarter successor that turned out to want to maximize thumbtacks instead. This argument is not subtle or inobvious: any GAI that didn’t already have this ability to handle existential risks well would be capable of deducing that it was unsafe as soon as the possibility of it encountering an existential risk occurred to it — so if we accidentally built a paperclip maximizer that wasn’t capable of thinking well about things that it regards as existential risks, it would almost certainly quickly self-modify to remedy this failing.

By definition, an existential risk is one whose utility is so astronomically negative that avoiding even a minute chance of causing it is going to dominate any calculation involving more normal amounts of utility. So if a GAI is attempting to avoid the maximizer’s curse, it needs to consider a wide range of hypotheses about the world, even ones it currently think are quite until unlikely, and its optimization of a lower confidence bound on utility will then avoid any action that any of those hypotheses (even the really unlikely ones) predicts has even a really small probability of causing/exacerbating an existential risk of an astronomically huge negative utility. Putting that in human terms, it needs to be intellectually humble, consider many alternatives including that possibility it could be mistaken about how things work, and be duly cautious about anything that could possibly cause or exacerbate any existential risk. That is going to put any such actions outside the current Overton-window-analog optimization region — if they otherwise look very promising, then safely investigating and eliminating that uncertainty without taking any existential risk would become a high priority.

Why Humans are so Bad at This

It is rather noticeable that human society is NOT currently acting in the way that I am suggesting any rational GAI value learner would act in the face of an existential risk: faced with the fact that quite a few famous, very smart, technically capable people (e.g. Steven Hawking, Elon Musk) and some experts on existential risks (e.g. Nick Bostrom) consider an AI apocalypse to be a significant existential risk, which should give anyone at least a small probability for the hypothesis that they’re right, and that that would be far, far worse than just us not having GAI delaying us, say, curing cancer or cleaning up the environment or accelerating our economic growth rate or becoming a multi-planetary species, we are still not halting or restricting all research on AI other than on solving the alignment problem, or even putting that much effort into solving the alignment problem. So, are humans (members of a species barely evolved past the sapience threshold) not very rational about existential risks (something we’ve only really been facing for about 70 years, since the development of thermonuclear weapons)?

Humans are famously bad at handing small risks of large dangers, especially abstract ones — we tend to either consistently underestimate them, or, if something about them triggers our irrational fears, consistently overestimate them. We act as if we’re unable to do the mental math to estimate a small risk and a large negative utility and then multiply the two together — sometimes rather as if we weren’t even doing a multiplication in whatever process we’re using. Why is that? If doesn’t sound like a very good design for something that was evolved to solve the problem of maximizing genetic fitness by surviving the African savanna — an environment that has plenty of situations with at least small probabilities of “you and the rest of your hunting party have been eaten by lions”-sized disasters.

The basic problem here is that human rationality is an add-on that evolution discovered while bolting it on top of a monkey. In their natural environment, monkeys face a very similar set of risks as human do in theirs: being eaten by predators, bitten by snakes or spiders, killed by a fall, dying of illness, starving or dying of thirst, being killed in a fight with a rival, and so forth. Monkeys already have a whole range of individually-tailored instinctual heuristics encoding pretty-good survival strategies for assessing and avoiding each one of these risks, and even weighing them against each other. So the rational abilities of the human brain were not evolved to do risk analysis from scratch: it was evolved as an add-on to add tweaks on top of a set of instincts that were already doing a pretty good job of basic risk analysis. So the human brain’s rationality is good at stuff like “I’m afraid of lions (that aren’t obviously still digesting their last meal), even a solitary one — but, scared though I am, that solitary hungry-looking lion visibly has cataracts, all six of us are downwind of it, and we’re all carrying flint-tipped spears, so I will be brave and overrule my fear”. It’s not good with inobvious novel abstract dangers that we don’t have instincts for handling — our ancestors already had instincts for all the significant risks that existed in the environment they evolved in.

In evolutionary terms, the great majority of the risks humans have faced since they became smart apes haven’t been significantly more difficult than the ones monkeys faced. There is one exception to that, which is violence from other humans — in most contemporary hunter-gatherer societies that anthropologists have studied, that’s astonishingly common: a few percent of all adult deaths, i.e. orders of magnitude higher than the murder rate in any modern society. Assuming that that was true of most hunter-gatherer societies during most of our recent evolution (and isn’t anything to do with the fact that the few remaining hunter-gatherer societies that anthropologies can study are mostly under strong resource completion from surrounding agriculturalists), then we might expect that human rationality had been under fairly strong evolutionary pressure over the last quarter-million years to get better at dealing with rivals as smart as a humans, so a good deal smarter than any rival a money had to deal with, i.e. ones now capable of doing things like using poison or pit traps or sabotage or spreading slander. So it wouldn’t be very surprising if humans (even in a more peaceful contemporary environment) had a tendency to think a lot about conspiracy theories, to read murder mysteries and watch action movies, and if one of their common failure modes during mental illness was paranoia — a thought mode that looks a lot less irrational in an environment where you have a several-percent chance of dying by violence at the hands of a rival at least as smart and devious as you.

As someone who takes the AI apocalypse seriously, and is capable of mentally multiplying an uncertain but not negligible risk of it by a rough orders of magnitude estimate its astronomically negative utility, I wonder if we can get the rest of humanity to switch from underestimating this risk to overestimating it (or, ideally, maybe even thinking rationally about it). That would likely require switching their view of it to start to trigger one-or-more of their instinctual fears — the one about murderous intelligent enemies seems the most relevant. The idea that robots might turn murderous and try to take over dates back centuries (though of course in the stories they always fail), and the idea that some areas of technology are too risky and uncertain to be allowed without strict controls and oversight is not exactly outside the Overton window (in the original political sense go the phrase).

Consequences for Value Learning

So far, this observation is just as applicable to an AIXI-style paperclip maximizer as to any other GAI that is a goal-oriented consequentialist doing an approximation of Solomonoff induction. But it also has some predictable consequence for value learners — anything that it attempting to simultaneously learn and optimize a human utility function while also learning about the universe.

If we assume the widely-believed propositions that “human utility is complex — it has a high Kolmogorov complexity”, and “it also fragile — using a hypothesis about its structure that is even a little bit mistaken can cause large errors in your utility calculation in at least some portions of the space of all possible states of the universe”, then that puts very high stakes on the optimizer’s curse. Some of those large errors are likely to be overestimates, so if a value learner is using a decision process that suffers from the optimizer’s curse, it’s going to keep making those mistakes and steering for goals that it mistakenly thinks are great, and even if once it reaches it realizes that it was wrong and Bayesian updates its hypotheses, it’s just going to keep doing this until it’s explored every mistake. For a target with a high Kolmogorov complexity, that will be a lot of disappointments.

The solution to this is that any value learner shouldn’t be using an optimization system that suffers from the optimizers curse — particularly for handling any situation with existential-risk-sized potential downsides. So, it should be considering may different hypotheses about what the human utility function is (i.e. what humans meant when they said “do what we collectively want, and take that instruction in the sense that we collectively want it taken, without us being able to precisely define what that means”), even ones it considers really unlikely. That means that it can almost never trust its utility estimates to go outside the Overton-window-analog, without first doing very careful investigations and research to extend the Overton-window-analog before actually trying this plan on any scale.

That means that, at least every time the agent wants to expand the Overton-window-analog, the region in which is has enough experience to established to a very high confidence level that is hypotheses about human utility give very accurate answers, it is going to have to be corrigible: it’s going to have to ask humans what they think. Even if it’s a lot smarter than humans, humans are the best available source on their likes and dislikes, and there is at least a small chance that talking to them will give it some useful insight. Also, it needs to be aware that humans don’t have as much computational power as it, they are unlikely to be able to predict all the consequences of something or how they would actually feel about those if they had had to live with them.

In order to be safely corrigible, you need a theory of human mistakes — a hypothesis for how to judge how much weight to put on any particular complaint, correction, agreement, disagreement or press of your shutdown button that you get from a human, depending on the entire context of what they knew, their physical and emotional state, and so forth. In fact, you need weighted ensemble of these hypotheses, on which you’re performing a resource-limited approximation to Solomon induction. A hypothesis about the theory of mistakes will produce a different way of weighting corrigibility input, which will make that corrigibility input cause different Bayesian updates on your Solomonoff induction of hypotheses about human values. So each individual hypothesis in the weighted ensemble of theories of human mistakes will soon produce a different weighted ensemble of hypotheses about human values. This may also to some extent be affected by hypothesis about how the real world responds to the agent’s actions — both it and humans are embedded agent in the same real world, so these hypothesis-spaces are not always going to be cleanly independent.

In addition, when humans originally told the agent “solve the alignment problem for us while acting as aligned as you can” or “do what we want, in a do-we-mean-way”, or however they first stated the problem, it likely wasn’t well stated or expressed: there were ambiguities, perhaps including what we would currently consider to be philosophical/ethical ones. Likely when the agent asked for clarification, it was told “we’re not sure — we’re not smart enough to pose the problem well, please figure out how we should have posed it, or perhaps how we would have posed it if we’d been smarter and more knowledgable, but otherwise unchanged, or how we would wish it to have been posed once we knew the future results different ways of doing it”, or it got inconsistent answers from different humans. So the agent also has to run Solomonoff induction on a weighted set of hypotheses about what the actual definition is of what value function it should be learning. Different hypotheses in this space (different definitions of the question) are clearly going to produce different weighted ensembles of hypotheses about the true world-state-to-utility answer to that question. Updates to this from corrections from humans obviously also depend on the theory of mistakes, and also vice-versa (the more philosophically subtle a theory of what humans really meant is, the harder it may be to get accurate human feedback on parts of it).

I believe the correct way to do approximate Solomonoff induction in this mutually-interdependent of hypotheses that each affect how the others are updated is to simply maintain a weighted ensemble of hypothesis-4-tuples of the form:

[hypothesis about how my actions lead to physical states,
 hypothesis about human mistakes,
 hypothesis about what the humans really meant when they instructed me to “do what we want done, in a do-what-we-mean way”,
 hypothesis about human utility function for physical states, with utility defined according to the previous item]

(i.e. a complete and self-consistent world-view) and then to do approximately Bayesian updates on the set of candidate hypothesis-4-tuples in the normal way, using the various hypotheses in the hypothesis-4-tuples to determine how observed evidence such as corrigibility feedback is interpreted during these approximately Bayesian updates. So the weighting of a hypothesis about values paired with a hypothesis about human mistakes that says that humans are often confused or lie is going to be less affected by a human saying “that’s nonsense” than the same hypothesis paired with a hypothesis about human mistakes that says that all humans are always correct about human values.

If (under at least some hypotheses of the third type), there are further important relevant questions I haven’t thought of that the agent needs hypotheses for to be able to do Bayesian updates, then this might become a 5-tuple or more.

Obviously when avoiding the optimizer’s curse, what matters is the uncertainty in the relative utility of possible goals being optimized over, which normally will only be directly affected by uncertainties in the first and fourth hypotheses (unless the fourth hypothesis says that paying more or less deference to recent specific instructions depending on your assessment of how likely they were to be mistake matters, which seems rather plausible, and would then induce a direct dependence on the second hypothesis as well). However, the Solomonoff induction process will rapidly produce correlations in the weighted ensemble between all four types of hypothesis, so effectively they are all involved. Then plans optimized over large spaces need to have their utility evaluated on suitably pessimistic lower-confidence bounds in a way that depend directly or indirectly on all of the entire ensemble of hypotheses-tuple uncertainty. So if, for example, not yet-resolved fundamental definitional interpretations, or different theories about human mistakes have caused us to still have a range of hypotheses about the human utility value of a particular physical state, at least at say the 99.9% confidence level, then the lower confidence bound on any optimization with that many effective uncertainties has the effect that you are pretty-much allowing any of those viewpoints to veto things it thinks are bad: you end up sticking to plans on which you have consensus.

All this sounds like desirable behavior in a value learner: if we’re not yet certain whether the agent should be using coherent extrapolated volition or not, or exactly which definition of it if so, the agent should act cautiously and conservatively (in the sense of “stay near the distribution of previously-observed states that you have high confidence in the utilities of“), and avoid any actions that any of these not-as-yet-resolved philosophical alternatives suggest are bad until it is very sure this ambiguity has been resolved.

To make it possible for your approximate Solomonoff induction process to converge, It’s probably important to include in your approximate Solomonoff induction process a hypothesis-validity-rule like that used by the scientific method saying that we don’t consider unfalsifiable hypotheses. If there are two different worldviews for which no conceivable observations/experiments/input from humans can change their relative weights in the approximate Solomonoff induction process, then if they make the same prediction about the human values of all physical states they’re isomorphic and should only be listed in the ensemble once, and if they make different predictions then we need to find some usefully principled way to select just one — say, minimum Kolmogorov complexity, or maximum corrigibility by humans, or some form of regularizer, or some combination of the these. Note that for this problem to occur it requires that both such theories have zero corrigibility about the correct value for this area of disagreement, otherwise human input could have an effect on their relative weights, so a simple requirement that degree of corrigibility always be non-zero (e.g. that it might asymptote to zero as the model become more sure of the truth or more certain that in some areas humans cannot provide useful input, but it can never actually reach zero) might be enough to solve it. Also note that Kolmogorov complexity is not a fully-defined concept — how much the complexity of a particular structure can be compressed depends both on the computational abilities of the device searching for a shortest description/ and doing the decompression, and the specific Turing-complete language that the length is being measured in. The latter can only vary things by a maximum of an additive constant, since any Turing-complete language can be defined from any other one in a finite amount of space, but the additive constants involved can be fairly large. Minimizing Kolmogorov complexity for any specific setup of computation device and Turing-complete language imposes biases compared to any other choice, and these could to some extent be selected by a suitable choice of Turing-complete language. So we might for example, attempt to use a Turing-complete language optimized for listing situations in which humans make mistakes, and less suitable for listing situations in which they can be trusted, so high-corrigibility hypotheses of theories of human mistakes tended to have shorter Kolmogorov complexity than low corrigibility ones. We could even make the null (zero length) hypothesis be something simple but plausible like “human’s tendency to make mistakes under different circumstances about questions of human values matches their tendency under the same circumstances to make mistakes about equal length questions involving logic puzzles” (or some similar subject where the truth or falsehood of answers can be determined objectively). So rather than defining a theory o human mistakes, define a theory define a theory of how human mistakes about value differ from their mistakes about some objective subject. That would roughly cover a lot of cases where you shouldn’t listen to a human, like them being dead or drunk, leading questions, and the most obvious significant effect it wouldn’t cover would be intentional falsehoods for gain.

[Link preview image by Kvnga on Unsplash]

A postscript:

After writing this post, and then doing some more reading, I discovered The optimizer’s curse and how to beat it — which more briefly proposes much the same way to beat the optimizer’s curse (giving me a lot more confidence that the proposal is workable).

(However, that post doesn’t go on to explore the consequences for caution, agents finding low risk ways of resolving uncertainly, staying near previously observed states, or value learning.)