# Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

*[Epistemic Status: My inside view feels confident, but I’ve only discussed this with one other person so far, so I won’t be surprised if it turns out to be confused.]*

Armstrong and Mindermann (A&M) argue “that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations.”

I explain why I think their argument is faulty, concluding that *maybe *Occam’s Razor is sufficient to do the job after all.

In what follows I assume the reader is familiar with the paper already or at least with the concepts within it.

## Brief summary of A&M’s argument:

(This is merely a brief sketch of A&M’s argument; I’ll engage with it in more detail below. For the full story, read their paper.)

Take a human policy pi = P(R) that we are trying to represent in the planner-reward formalism. R is the human’s reward function, which encodes their desires/preferences/values/goals. P() is the human’s planner function, which encodes how they take their experiences as input and try to choose outputs that achieve their reward. Pi, then, encodes the overall behavior of the human in question.

**Step 1:** In any reasonable language, for any plausible policy, you can construct “degenerate” planner-reward pairs that are *almost* as simple as the simplest possible way to generate the policy, yet yield high regret (i.e. have a reward component which is very different from the “true”/”Intended” one.)

Example: The planner deontologically follows the policy, despite a buddha-like empty utility function

Example: The planner greedily maximizes the reward function “obedience-to-the-policy.”

Example: Double-negated version of example 2.

It’s easy to see that these examples, being constructed from the policy, are *at most *slightly more complex than the simplest possible way to generate the policy, since they could make use of that way.

**Step 2:** The “intended” planner-reward pair—the one that humans would judge to be a reasonable decomposition of the human policy in question—is likely to be significantly more complex than the simplest possible planner-reward pair.

Argument: It’s really complicated.

Argument: The pair contains more information than the policy, so it should be more complicated.

Argument: Philosophers and economists have been trying for years and haven’t succeeded yet.

**Conclusion: **If we use Occam’s Razor alone to find planner-reward pairs that fit a particular human’s behavior, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are building an AI to maximize the reward.

## Methinks the argument proves too much:

My first point is that A&M’s argument probably works just as well for other uses of Occam’s Razor. In particular it works just as well for the canonical use: finding the Laws and Initial Conditions that describe our universe!

Take a sequence of events we are trying to predict/represent with the lawlike-universe formalism, which posits C (the initial conditions) and then L() the dynamical laws, a function that takes initial conditions and extrapolates everything else from them. L(C) = E, the sequence of events/conditions/world-states we are trying to predict/represent.

**Step 1:** In any reasonable language, for any plausible sequence of events, we can construct “degenerate” initial condition + laws pairs that are almost as simple as the simplest pair.

Example: The initial conditions are an empty void, but the laws say “And then the sequence of events that happens is E”

Example: The initial conditions are simply E, and L() doesn’t do anything.

It’s easy to see that these examples, being constructed from E, are *at most *slightly more complex than the simplest possible pair, since they could use the simplest pair to generate E.

**Step 2:** The “intended” initial condition+law pair is likely to be significantly more complex than the simplest pair.

Argument: It’s really complicated.

Argument: The pair contains more information than the sequence of events, so it should be more complicated.

Argument: Physicists have been trying for years and haven’t succeeded yet.

**Conclusion: **If we use Occam’s Razor alone to find law-condition pairs that fit all the world’s events, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are e.g. building an AI to do science for us and answer counterfactual questions like “If we had posted the nuclear launch codes on the Internet, would any nukes have been launched?”

This conclusion may actually be true, but it’s a pretty controversial claim and I predict most philosophers of science wouldn’t be impressed by this argument for it—even the ones who agree with the conclusion.

## Objecting to the three arguments for Step 2

Consider the following hypothesis, which is basically equivalent to the claim A&M are trying to disprove:*Occam Sufficiency Hypothesis:**The “Intended” pair happens to be the simplest way to generate the policy.*

Notice that everything in Step 1 is consistent with this hypothesis. The first degenerate pairs are constructed from the policy, so they are more complicated than the simplest way to generate it, so if that way is via the intended pair, they are more complicated (albeit only slightly) than the intended pair.

Next, notice that the three arguments in support of Step 2 don’t really hurt this hypothesis:**Re: first argument: **The intended pair can be both very complex and the simplest way to generate the policy; no contradiction there. Indeed that’s not even surprising: since the policy is generated by a massive messy neural net in an extremely diverse environment, we should expect it to be complex. What matters for our purposes is *not *how complex the intended pair is, but rather how complex it is *relative to the simplest possible way to generate the policy.* A&M need to argue that the simplest possible way to generate the policy is simpler than the intended pair; arguing that the intended pair is complex is at best only half the argument.

Compare to the case of physics: Sure, the laws of physics are complex. They probably take at least a page of code to write up. And that’s aspirational; we haven’t even got to that point yet. But that doesn’t mean Occam’s Razor is insufficient to find the laws of physics.

**Re: second argument:** The inference from “This pair contains more information than the policy” to “this pair is more complex than the policy” is fallacious. Of course the intended pair contains more information than the policy! All ways of generating the policy contain more information than it. This is because there are many ways (e.g. planner-reward pairs) to get any given policy, and thus specifying any particular way is giving you strictly more information than simply specifying the policy.

Compare to the case of physics: Even once we’ve been given the complete history of the world (or a complete history of some arbitrarily large set of experiment-events) there will still be additional things left to specify about what the laws and initial conditions truly are. Do the laws contain a double negation in them, for example? Do they have some weird clause that creates infinite energy but only when a certain extremely rare interaction occurs that never in fact occurs? What language are the laws written in, anyway? And what about the initial conditions? Lots of things left to specify that aren’t determined by the complete history of the world. Yet this does not mean that the Laws + Initial Conditions are more complex than the complete history of the world, and it certainly doesn’t mean we’ll be led astray if we believe in the Laws+Conditions pair that is simplest.

**Re: third argument: **Yes, people have been trying to find planner-reward pairs to explain human behavior for many years, and yes, no one has managed to build a simple algorithm to do it yet. Instead we rely on all sorts of implicit and intuitive heuristics, and we still don’t succeed fully. *But all of this can be said about Physics too*. It’s not like physicists are literally following the Occam’s Razor algorithm—iterating through all possible Law+Condition pairs in order from simplest to most complex and checking each one to see if it outputs a universe consistent with all our observations. And moreover, physicists haven’t succeeded fully either. Nevertheless, many of us are still confident that Occam’s Razor is in principle sufficient: If we were to follow the algorithm exactly, with enough data and compute, we would eventually settle on a Law+Condition pair that accurately describes reality, and it would be the true pair. Again, maybe we are wrong about that, but the arguments A&M have given so far aren’t convincing.

## Conclusion

Perhaps Occam’s Razor is insufficient after all. (Indeed I suspect as much, for reasons I’ll sketch in the appendix) But as far as I can tell, A&M’s arguments are at best very weak evidence against the sufficiency of Occam’s Razor for inferring human preferences, and moreover they work pretty much just as well against the canonical use of Occam’s Razor too.

This is a bold claim, so I won’t be surprised if it turns out I was confused. I look forward to hearing people’s feedback. Thanks in advance! And thanks especially to Armstrong and Mindermann if they take the time to reply.

*Many thanks to Ramana Kumar for hearing me out about this a while ago when we read the paper together.*

## Appendix: So, is Occam’s Razor sufficient or not?

--A priori, we should expect something more like a speed prior to be appropriate for identifying the mechanisms of a finite mind, rather than a pure complexity prior.

--Sure enough, we can think of scenarios in which e.g. a deterministic universe with somewhat simple laws develops consequentialists who run massive simulations including of our universe and then write down Daniel’s policy in flaming letters somewhere, such that the algorithm “Run this deterministic universe until you find big flaming letters, then read out that policy” becomes a very simple way to generate Daniel’s policy. (This is basically just the “Universal Prior is Malign” idea applied in a new way.)

--So yeah, pure complexity prior is probably not good. But maybe a speed prior would work, or something like it. Or maybe not. I don’t know.

--One case that seems useful to me: Suppose we are considering two explanations of someone’s behavior: (A) They desire the well-being of the poor, but [insert epicycles here to explain why they aren’t donating much, are donating conspicuously, are donating ineffectively] and (B) They desire their peers (and their selves) to *believe *that they desire the well-being of the poor. Thanks to the epicycles in (A), both theories fit the data equally well. But theory B is much more simple. Do we conclude that this person really does desire the well-being of the poor, or not? If we think that even though (A) is more complex it is also more accurate, then yeah it seems like Occam’s Razor is insufficient to infer human preferences. But if we instead think “Yeah, this person just really doesn’t care, and the proof is how much simpler B is than A” then it seems we really are using something like Occam’s Razor to infer human preferences. Of course, this is just one case, so the only way it could prove anything is as a counterexample. To me it doesn’t seem like a counterexample to Occam’s sufficiency, but I could perhaps be convinced to change my mind about that.

--Also, I’m pretty sure that once we have better theories of the brain and mind, we’ll have new concepts and theoretical posits to explain human behavior. (e.g. something something Karl Friston something something free energy?) Thus, the simplest generator of a given human’s behavior will probably not divide *automatically* into a planner and a reward; it’ll probably have many components and there will be debates about which components the AI should be faithful to (dub these components the reward) and which components the AI should seek to surpass (dub these components the planner.) These debates may be intractable, turning on subjective and/or philosophical considerations. So this is another sense in which I think yeah, definitely Occam’s Razor isn’t sufficient—for we will also need to have a philosophical debate about what rationality is.

- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 138 points) (
- 19 Oct 2019 2:17 UTC; 1 point) 's comment on Humans can be assigned any values whatsoever… by (

Some objections:

The thing that you can’t do is

decomposebehavior into planner and reward. If you just want to predict behavior, you can totally do that. Similarly, you can predict future events with physics.You do need to do the decomposition to run counterfactuals. And indeed I buy the claim that if you literally try to find some input I and some dynamics D such that D(I) is the world trajectory, selecting only by Kolmogorov complexity and accuracy at predicting data, you probably won’t be able to use the resulting D to run counterfactuals. Even ignoring the malign universal prior argument.

If it turns out you can run counterfactuals with D, I would strongly expect that to be because physics “actually” works by some simple D that is “invariant” to the input state. In contrast, I would be astonished if humans “actually” have some reward R in their head that they are trying to maximize, and that is what drives behavior.

I don’t feel much better about the speed prior than the regular Solomonoff prior.

Thanks! I’m not sure I follow you. Here’s what I think you are saying:

--Occam’s Razor will be sufficient for predicting human behavior of course; it just isn’t sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.

--You agree that this argument also works for Laws+Initial Conditions; Occam’s Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!

--You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.

If this is what you are saying, then I agree with the second and third points but disagree with the first—or at least, I don’t see any argument for it in A&M’s paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks—that’s what my section “Objections to the arguments for step 2” is about.

Edit to clarify: By “I agree with the second point” I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don’t think the argument works though. But I do think that Occam’s Razor is probably insufficient.

That’s an accurate summary of what I’m saying.

If you are picking randomly out of a set of N possibilities, the chance that you pick the “correct” one is 1/N. It seems like in any decomposition (whether planner/reward or initial conditions/dynamics), there will be N decompositions, with N >> 1, where I’d say “yeah, that probably has similar complexity as the correct one”. The chance that the correct one is also the simplest one out of all of these seems basically like 1/N, which is ~0.

You could make an argument that we aren’t actually choosing randomly, and correctness is basically identical to simplicity. I feel the pull of this argument in the limit of infinite data for laws of physics (but not for finite data), but it just seems flatly false for the reward/planner decomposition.

I feel like there’s a big difference between “similar complexity” and “the same complexity.” Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity… yet realistically an Occam’s-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I’m wrong about this then it seems like Occam’s Razor is broken in general; in any domain there are going to be ways to turn T’s into T*’s. But Occam’s Razor is not broken in general (I feel).

Maybe this is the argument you anticipate above with ”...we aren’t actually choosing randomly.” Occam’s Razor isn’t random. Again, I might agree with you that intuitively Occam’s Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren’t arguments that appeared in the text of A&M’s paper.

I thought about this more and re-read the A&M paper, and I now have a different line of thinking compared to my previous comments.

I still think A&M’s No Free Lunch theorem goes through, but now I think A&M are proving the wrong theorem. A&M try to find the simplest (planner, reward) decomposition that is compatible with the human policy, but it seems like we instead additionally want compatibility with all the evidence we have observed, including sensory data of humans saying things like “if I was more rational, I would be exercising right now instead of watching TV” and “no really, my reward function is

notempty”. The important point is that such sensory data gives us information not just about the human policy, but also about the decomposition. Forcing compatibility with this sensory data seems to rule out degenerate pairs. This makes me feel like Occam’s Razor would work for inferring preferences up to a certain point (i.e. as long as the situations are all “in-distribution”).If we are trying to find the (planner, reward) decomposition of non-human minds: I think if we were randomly handed a mind from all of mind design space, then A&M’s No Free Lunch theorem would apply, because the simplest explanation really is that the mind has a degenerate decomposition. But if we were randomly handed an alien mind from our universe, then we would be able to use all the facts we have learned about our universe, including how the aliens likely evolved, any statements they seem to be making about what they value, and so on.

Does this line of thinking also apply to the case of science? I think not, because we wouldn’t be able to use our observations to get information about the decomposition. Unlike the case of values, the natural world isn’t making statements like “actually, the laws are empty and all the complexity is in the initial conditions”. I still don’t think the No Free Lunch theorem works for science either, because of my previous comments.

That is the whole point of my research agenda: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into

The problem is that the non-subjective evidence does not map onto facts about the decomposition. A human claims X; well, that’s a speech act; are they telling the truth or not, and how do we know? Same for sensory data, which is mainly data about the brain correlated with facts about the outside world; to interpret that, we need to solve human symbol grounding.

All these ideas are in the research agenda (especially section 2). Just as you need something to bridge the is-ought gap, you need some assumptions to make evidence in the world (eg speech acts) correspond to preference-relevant facts.

This video may also illustrate the issues: https://www.youtube.com/watch?v=1M9CvESSeVc&t=1s

Hmm, I like that. I wonder what A&M would say in response. And I agree this is an important and relevant difference between the case of preferences and the case of science.

I still don’t think A&M show that the simplest explanation is a degenerate decomposition. They show that

ifit is, then Occam’s Razor won’t be sufficient, and moreover that there are some degenerate decompositions pretty close to maximally simple. But they don’t do much to rule out the possibility that the simplest explanation is the intended one.Hey there!

Thanks for this critique; I have, obviously, a few comments ^_^

In no particular order:

First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I’m coming from: https://www.youtube.com/watch?v=1M9CvESSeVc

A useful point from that:

givenhuman theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it’s hard for us to turn off our theory of mind, the decomposition will always feel simple to us. However, the human theory of mind suffers from Moravec’s paradox: though the theory of mind seems simple to us, it is very hard to specify, especially into code.You’re entirely correct to decompose the argument into Step 1 and Step 2, and to point out that Step 1 has much stronger formal support than Step 2.

I’m not too worried about the degenerate pairs specifically; you can rule them all out with two bits of information. But, once you’ve done that, there will be other almost-as-degenerate pairs that bit with the new information. To rule them out, you need to add more information… but by the time you’ve added all of that, you’ve essentially defined the “proper” pair, by hand.

On speed priors: the standard argument applies for a speed prior, too (see Appendix A of our paper). It applies perfectly for the indifferent planner/zero reward, and applies, given an extra assumption, for the other two degenerate solutions.

Onto the physics analogy! First of all, I’m a bit puzzled by your claim that physicists don’t know how to do this division. Now, we don’t have a full theory of physics; however, all the physical theories I know of, have a very clear and known division between laws and initial conditions. So physicists do seem to know how to do this. And when we say that “it’s very complex”, this doesn’t seem to mean the division into laws and initial conditions is complex, just that the initial conditions are complex (and maybe that the laws are not yet known).

The indifference planner contains almost exactly the same amount of on information as the policy. The “proper” pair, on the other hand, contains information such as whether the anchoring bias is a bias (it is) compared with whether paying more for better tasting chocolates is a bias (it isn’t). Basically, none of the degenerate pairs contain any bias information at all; so everything to do with human biases is extra information that comes along with the “proper” pair.

Even ignoring all that, the fact that (p,R) is of comparable complexity to (-p,-R) shows that Occams razor cannot distinguish the proper pair from its negative.

And thanks for the reply!

FWIW, I like the research agenda. I just don’t like the argument in the paper. :)

--Yes, without theory of mind the decomposition is complex. But is it more complex than the simplest way to construct the policy? Maybe, maybe not. For all you said in the paper, it could still be that the simplest way to construct the policy is via the intended pair, complex though it may be. (In my words: The Occam Sufficiency Hypothesis might still be true.)

--If the Occam Sufficiency Hypothesis is true, then not only do we not have to worry about the degenerate pairs, we don’t have to worry about anything more complex than them either.

--I agree that your argument, if it works, applies to the speed prior too. I just don’t think it works; I think Step 2 in particular might break for the speed prior, because the Speed!Occam Sufficiency Hypothesis might be true.

--If I ever said physicists don’t know how to distinguish between laws and initial conditions, I didn’t mean it. (Did I?) What I thought I said was that physicists haven’t yet found a law+IC pair that can account for the data we’ve observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren’t just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.

--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn’t mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it’s not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let’s assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.

I realize I may be confused here about how complexity or information works; please correct me if so!

But anyhow if I’m right about this then I am skeptical of conclusions drawn from information to complexity… I’d like to see the argument made more explicit and broken down more at least.

For example, the “proper” pair contains all this information about what’s a bias and what isn’t, because our definition of bias references the planner/reward distinction. But isn’t that unfair? Example: We can write 99999999999999999999999 or we can write “20-digits of 9′s.” The latter is shorter, but it contains more information if we cheat and say it tells us things like “how to spell the word that refers to the parts of a written number.”

Anyhow don’t the degenerate pairs also contain information about biases—for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?

--If it were true that Occam’s Razor can’t distinguish between P,R and -P,-R, then… isn’t that a pretty general argument against Occam’s Razor, not just in this domain but in other domains too?

--

Hey there!

Responding to a few points. But first, I want to make the point that treating an agent as (p,R) pair is basically an intentional stance. We choose to treat the agent that way, either for ease of predicting its actions (Dennet’s approach) or for extracting its preferences, to satisfy them (my approach). The decomposition is not a natural fact about the world.

No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn’t, we’d still have to make decisions).

(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).

Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.

That argument shows that if you look into the algorithm, you can get other differences. But I’m not looking into the algorithm; I’m just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.

Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of −1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the “proper” pair.

The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, and the three degenerate pairs are almost as simple as the policy. However, the “proper” pair can generate a complicated object, the bias function (which has a non-trivial value in almost every possible state). So the proper pair contains at least enough information to specify a) the human policy, and b) the bias function. The kolmogorov complexity of the proper pair is thus at least that of the simplest algorithm that can generate both those objects.

So one of two things are happening: either the human policy can generate the bias function directly, in some simple way

^{[1]}, or the proper pair is more complicated that the policy. The first is not impossible, but notice that it has to be “simple”. So the fact that we have not yet found a way to generate the bias function from the policy is an argument that it can’t be done. Certainly there are no elementary mathematical manipulations of the policy that produces anything suitable.No, because Occam’s razor works in other domains. This is a strong illustration that this domain is actually different.

Let A be the simplest algorithm that generates the human policy, and B the simplest that generates the human policy and the bias function. If there are n different algorithms that generate the human policy and are of length |B| or shorter, then we need to add log2(n) bits of information to the human policy to generate B, and hence, the bias function. So if B is close is complexity to A, be don’t need to add much. ↩︎

Thanks again! I still disagree, surprise surprise.

I think I agree with you that the (p,R) decomposition is not a natural fact about the world, but I’m not so sure. Anyhow I don’t think it matters for our purposes.

Physicists are trying to do many things. Yes, one thing they are trying to do is predict what it happening in the world. But another thing they are trying to do is figure out stuff about counterfactuals, and for that they need to have a Laws+IC decomposition to work with. So they take their data and they look for a simple Laws+IC decomposition that fits it. They would still do this even if they already knew the results of all the experiments ever, and had no more need to predict things. (Extending the symmetry, humans also typically use the intentional stance on incomplete data about a target human’s policy, for the purpose of predicting the rest of the policy. But this isn’t what you concern yourself with; you assume for the sake of argument that we already have the whole policy and point out that we’d still want to use the intentional stance to get a decomposition so that we could make judgments about rationality. I say yes, true, now apply the same reasoning to physics: assume for the sake of argument that we already know everything that will happen, all the events, and notice that we’d still want to have a Laws+IC decomposition, perhaps to figure out counterfactuals.)

I was assuming there were multiple Law+IC pairs that would generate E… well actually no, the example degenerate pairs I gave prove that there are, no need to assume it!

I don’t see the difference between what you are doing and what I did. You started with a policy and said “But what about bias-facts? The policy by itself doesn’t tell us these facts. So let’s look at the various decompositions of the policy into p,R pairs; they tell us the bias facts.” I start with a number and say “But what about how-to-spell-the-word-that-refers-to-the-parts-of-a-written-number facts? The number doesn’t tell us that. Let’s look at the various decompositions of the number into strings of symbols that represent it; they tell us those facts.”

Thanks for the clarification—that’s what I suspected. So then

everyp,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy. By analogous reasoning,everyalgorithm for constructing the policy contains more information than the policy. So even the simplest algorithm for constructing the policy contains more information than the policy. So (by your reasoning) even the simplest algorithm for constructing the policy is more complex than the policy. But this isn’t so; the simplest algorithm for constructing the policy is length L and so has complexity L, and the policy has complexity L too… That’s my argument at least. Again, maybe I’m misunderstanding how complexity works. But now that I’ve laid it out step-by-step, which step do you disagree with?Wait what? This is what I was objecting to in the original post. The “Occam Sufficiency Hypothesis” is that the human policy is not simpler than all pairs; in particular, it is precisely the simplicity of the intended pair, because the intended pair is the simplest way to construct the policy.

What are the reasonable assumptions that lead to the OSH being false?

My objection to your paper, in a nutshell, was that you didn’t discuss this part—you didn’t give any reason to think OSH was false. The three reasons you gave in Step 2 were reasons to think the intended pair is complex, not reasons to think it is

morecomplex than the policy. Or so I argued.My argument is that if you are right, Occam’s Razor would be generally useless, but i’s not, so you are wrong. In more detail: If Occam’s Razor can’t distinguish between P,R and -P,-R, then (by analogy) it an arbitrary domain it won’t be able to distinguish between theory X and theory b(X) where b() is some simple bizzaro function that negates or inverts the parts of X in such a way as to make it the changes cancel out.

I’m not sure the physics analogy is getting us very far—I feel there is a very natural way of decomposing physics into laws+initial conditions, while there is no such natural way of doing so for preferences and rationality. But if we have different intuitions on that, then discussing the analogy doesn’t isn’t going to help us converge!

Agreed (though the extra information may be tiny—a few extra symbols).

That does not follow; the simplest algorithm for building a policy does not go via decomposing into two pieces and then recombining them. We are comparing algorithms that produce a planner-reward pair (two outputs) with algorithms that produce a policy (one output). (but your whole argument shows you may be slightly misunderstanding complexity in this context).

Now, though all pairs are slightly more complex than the policy itself, the bias argument shows that the “proper” pair is considerably more complex. To use an analogy: suppose file1 and file2 are both maximally zipped files. When you unzip file1, you produce image1 (and maybe a small, blank, image2). When you unzip file2, you also produce the same image1, and a large, complex, image2′. Then, as long as image1 and image2′ are at least slightly independent, file2 has to be larger than file1. The more complex image2′ is, and the more independent it is from image1, the larger file2 has to be.

Does that make sense?

I agree that the decomposition of physics into laws+IC is much simpler than the decomposition of a human policy into p,R. (Is that what you mean by “more natural?”) But this is not relevant to my argument, I think.

I feel that our conversation now has branched into too many branches, some of which have been abandoned. In the interest of re-focusing the conversation, I’m going to answer the questions you asked and then ask a few new ones of my own.

To your questions: For me to understand your argument better I’d like to know more about what the pieces represent. Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts? Then what is the “unzip” function? Pairs don’t unzip to anything. You can apply the function “apply the first element of the pair to the second” or you can apply the function “do that, and then apply the MAXIMIZE function to the second element of the pair and compute the difference.” Or there are infinitely many other things you can do with the pair. But the pair itself doesn’t tell you what to do with it, unlike a zipped file which is like an algorithm—it tells you “run me.”

I have two questions. 1. My central claim—which I still uphold as not-ruled-out-by-your-arguments (though of course I don’t actually believe it) is the Occam Sufficiency Hypothesis: “The ‘intended’ pair is the simplest way to generate the policy.” So, basically, what OSH says is that within each degenerate pair is a term, pi (the policy), and when you crack open that term and see what it is made of, you see p(R), the intended policy applied to the intended reward function! Thus, a simplicity-based search will stumble across <p,R> before it stumbles across any of the degenerate pairs, because it needs p and R to construct the degenerate pairs. What part of this do you object to?

2. Earlier you said “given reasonable assumptions, the human policy is simpler than all pairs” What are those assumptions?

Once again, thanks for taking the time to engage with me on this! Sorry it took me so long to reply, I got busy with family stuff.

Yes.

The “shortest algorithm generating BLAH” is the maximally compressed way of expressing BLAH—the “zipped” version of BLAH.

Ignoring unzip, which isn’t very relevant, we know that the degenerate pairs are just above the policy in complexity.

So zip(degenerate pair) ≈ zip(policy), while zip(reasonable pair) > zip(policy+complex bias facts) (and zip(policy+complex bias facts) > zip(policy)).

Does that help?

It helps me to understand more clearly your argument. I still disagree with it though. I object to this:

I claim this begs the question against OSH. If OSH is true, then zip(reasonable pair) ≈ zip(policy).

Indeed. It

mightbe possible to construct that complex bias function, from the policy, in a simple way. But that claim needs to be supported, and the fact that it hasn’t been found so far (I repeat that it has to be simple) is evidence against it.Physics doesn’t work on Occam’s razor alone. You need an IC/law division to be able to figure out counterfactuals, but equally you can implement counterfactuals in the form of experiment, and use them to figure out the IC/law split.

How does that alternate method work? Implementing counterfactuals in the form of an experiment?

That would.. just performing an experiment. All experiments answer a “what if” question.

I think that’s a bit controversial. Experiments tell us what happens in one timeline, the actual one… just like everything else we see and do. They don’t tell us what would have happened if such-and-such had occurred, because such-and-such didn’t in fact occur.

After the experiment has been performed, the counterfactual is now actual, but it was a counterfactual beforehand. Even if you take the view that everything is determined, experiments are still exploring logical counterfactuals. On the other hand, if you assume holism, then you can’t explore counterfactuals with experiments because you can’t construct a complete state of the universe.

I’m pretty sure that’s not how counterfactuals are normally thought to work. “Counterfactual” means contrary-to-the-facts. Something that is true is not contrary to the facts.

Argument: If you are right, then why is this only true for experiments? Isn’t it equally true for anything that happens—before it happens, it’s just a counterfactual, and then after it happens, it’s actual?

I’m not confident I’ve understood this post, but it seems to me that the difference between the values case and the empirical case is that in the values case, we want to do better than humans at achieving human values (this is the “ambitious” in “ambitious value learning”) whereas in the empirical case, we are fine with just predicting what the universe does (we aren’t trying to predict the universe even better than the universe itself). In the formalism, in π = P(R) we are after R (rather than π), but in E = L(C) we are after E (rather than L or C), so in the latter case it doesn’t matter if we get a degenerate pair (because it will still predict the future events well). Similarly, in the values case, if all we wanted was to imitate humans, then it seems like getting a degenerate pair would be fine (it would act just as human as the “intended” pair).

I don’t understand how this conclusion follows (unless it’s about the malign prior, which seems not relevant here). Could you give more details on why answering counterfactual questions like this would be dangerous?

Thanks! OK, so I agree that normally in doing science we are fine with just predicting what will happen, there’s no need to decompose into Laws and Conditions. Whereas with value learning we are trying to do more than just predict behavior; we are trying to decompose into Planner and Reward so we can maximize Reward.

However the science case can be made analogous in two ways. First, as Eigil says below, realistically we don’t have access to ALL behavior or ALL events, so we will have to accept that the predictor which predicted well so far might not predict well in the future. Thus if Occam’s Razor settles on weird degenerate predictors, it might also settle on one that predicts well up until time T but then predicts poorly after that.

Second, (this is the way I went, with counterfactuals) science isn’t all about prediction. Part of science is about answering counterfactual questions like “what would have happened if...” And typically the way to answer these questions is by decomposing into Laws + Conditions and then doing a surgical intervention on the conditions and then applying the same Laws to the new conditions.

So, for example, if we use Occam’s Razor to find Laws+Conditions for our universe, and somehow it settles on the degenerate pair “Conditions := null, Laws := sequence of events E happens” then all our counterfactual queries will give bogus answers—for example, “what would have happened if we had posted the nuclear launch codes on the Internet?” Answer: “Varying the Conditions but holding the Laws fixed… it looks like E would have happened. So yeah, posting launch codes on the Internet would have been fine, wouldn’t have changed anything.”

Thanks for the explanation, I think I understand this better now.

My response to your second point: I wasn’t sure how the sequence prediction approach to induction (like Solomonoff induction) deals with counterfactuals, so I looked it up, and it looks like we can convert the counterfactual question into a sequence prediction question by appending the counterfactual to all the data we have seen so far. So in the nuclear launch codes example, we would feed the sequence predictor with a video of the launch codes being posted to the internet, and then ask it to predict what sequence it expects to see next. (See the top of page 9 of this PDF and also example 5.2.2 in Li and Vitanyi for more details and further examples.) This doesn’t require a decomposition into laws and conditions; rather it seems to require that the events E be a function that can take in bits and print out more bits (or a probability distribution over bits). But this doesn’t seem like a problem, since in the values case the policy π is also a function. (Maybe my real point is that I don’t understand why you are assuming E has to be a sequence of events?) [ETA: actually, maybe E can be just a sequence of events, but if we’re talking about complexity, there would be some program that generates E, so I am suggesting we use that program instead of L and C for counterfactual reasoning.]

My response to your first point: I am far from an expert here, but my guess is that an Occam’s Razor advocate would bite the bullet and say this is fine, since either (1) the degenerate predictors will have high complexity so will be dominated by simpler predictors, or (2) we are just as likely to be living in a “degenerate” world as we are to be living in the kind of “predictable” world that we think we are living in.

Where we can predict, we do so by feeding a set of conditions into laws.

Methodologically, counterfactuals and predictions are almost the same thing. In the case of a prediction , you feed an actual condition into your laws, in the case of a counterfactual, you feed in a non-actual. one.

A simple remark: we don’t have access to all of E, only E up until the current time. So we have to make sure that we don’t get a degenerate pair which diverges wildly from the actual universe at some point in the future.

Maybe this is similar to the fact that we don’t want AIs to diverge from human values once we go off-distribution? But you’re definitely right that there’s a difference: we

dowant AIs to diverge from humanbehaviour(even in common situations).This is neat. It makes me realize that thinking in terms of simplicity and complexity priors was serving somewhat as a semantic stop sign for me whereas speed prior vs slow prior doesn’t.

When we decompose the sequence of events E into laws L and initial conditions C, the laws dont just calculate E from C. Rather, L(C)=E1,L(E1)=E2,... L is a function form events->events, and the sequence E contains many input-output pairs of L.

By contrast, when we decompose a policy π into a planner P and a reward R, P is a function from rewards->policy. With the setup of the problem as-is, we have data on many instances of (s;a) pairs (behaviour), so we can infer π with high accuracy. But we only get to see

onepolicy, and we never get to explicitly see rewards. In such a case, indeed we will get the empty reward and ∀r[P(r)=π]. To correctly infer R and P, we would have to see our P applied to some other rewards, and the policies resulting from that.Take the limit as we observe more and more behavior—it takes a million bits to specify E, for example, or a billion. Then the utility maximizer and utility minimizer are both much much simpler (can be specified in fewer bits) than the Buddha-like zero utility agent (assuming E is in fact consistent with a simple utility function). Likewise, in that same limit, the true laws of physics plus initial conditions are much much simpler than saying “L=0 and E just happens”. Right? Sorry if I’m misunderstanding, I haven’t read A&M.

The trick is that you can

usethe simplest method for constructing E in your statement “L=0 and E just happens.” So e.g. if you have some simple Laws l and Conditions c such that l(c) = E, your statement can be “L=0 and l(c) just happens.”I think the physics analogy here is really cool—the idea of drawing a parallel between the pair “what a person wants and how how they behave to get those things” and the pair “how the universe is set up and how it behaves as a result” is an interesting one.

However, arguably, I think many physicists have already settled on a degenerate model of physics: The idea behind the Copenhagen Interpretation is essentially that given an initial condition, some event (partially defined by those conditions) will just randomly happen. It’s not exactly one of the degenerate examples you give (because a lot of rules can be extracted from the initial conditions about how those random things happen) but, at the end of the day, lots of people already accept that the initial-conditions to laws-of-physics pairing is best described by saying “sometimes some things happen and sometimes other things happen.”

I don’t see what’s degenerate about it at all.

Every interpretation yields the same results. There’s no known way of rejigging the initial conditions-to-laws-of-evolution balance that does better.

Exactly. The fact that multiple conceptually distinct rule-sets yield the same results is what makes it degenerate. In the same way that a single policy can be described exactly by multiple degenerate reward functions of similar complexity, the evolution of the universe can be described exactly by multiple sets of physical laws of similar complexity. Sure the randomness the best we can do in terms of prediction but the underlying way that randomness is produced is degenerate:

1. The next state of the universe is evolved from the current state by a combination of details about the current state and a random fluctuation that just happened

2. The next state of the universe is evolved from the current state by a combination of details about the current state and a set of events by the laws of the universe which only appear random to us

3. The next state of the universe is evolved from the current state by a combination of details about the current state and a set of observationally random events that were chosen to occur in sequence before the beginning of the universe

and so on...

I personally like quantum mechanics though. I’m just picking on it because, while many formulations of deterministic laws exist, people can always make the argument that their “different” interpretations are just different mathematical reformulations of a single concept. In contrast, it’s easy to pick conceptually different ways in which observationally random events are produced.

In science, the distinction between 1, 2 and 3 don’t matter since they all predict the same things. But similar distinctions in terms of reward functions matter greatly because they, intuitively, imply different “subjective” experiences. But, the upshot is that the article’s claim that “physics being degenerate” is a controversial idea isn’t something I believe.

What does the singular “it” refer to? You could claim that QM is degenerate because multiple formulations lead to the same result, but you

seemedto have a specific beef with Copenhagen.Much more than that. There is a lot of moral concern about whether someone is doing something bad as a result of trying to do something good incompetently, or doing something bad intentionally.

I picked Copenhagen because it involves collapsing a wave-function to a random state for a specific universe (ie, the universe evolves in a way that is partially random). If you’re a many worlds theorist, you could plausibly claim that, since the probability distribution describes how frequently different kinds of worlds happen with respect to each other, the universe doesn’t evolve randomly at all—what we perceive as randomness describes an deterministic distribution of all possible worlds.

To me, it looks easy to rebut this argument—you just point out that there is still randomness in your subjective perspective of the world. But then someone else might question that because your “subjective perspective” becomes a matter of anthropics and then the whole conversation gets into some confusing weeds that would dramatically lengthen the amount of time I need to think about things. So I picked Copenhagen specifically as a short-cut.

So yeah, I was picking on Copenhagen because it’s easier to establish in the context of the point I was trying to make (quantum mechanics is degenerate). But I wasn’t picking on it because other interpretations of QM are less problematic than Copenhagen.

Also to clarify:

I don’t have a beef with Copenhagen or with QM. I just think its a degenerate world model and, with the definition I’m using, degenerate world models of the kind that QM is aren’t a bad thing.

Even more dramatically than that, we can reverse this to get another important implication! If you’re trying to figure out what’s good for a person based on the consequences they seem to be seeking out, you can’t tell whether that person actually wants the consequences of their behavior (ie the consequences are subjectively good) or whether they want something else but are going about it in an irrational and ineffective way (ie the consequences are subjectively indeterminate). This is really bad for AI alignment.

As a sidenote: One might try to solve this problem by just applying Occam’s Razor (doesn’t it seem more likely and more simple that someone is acting in ways reflective of their preferences rather than incompetence?). But whether this actually works seems unlikely to me because

-The paper this article is trying to rebut indicates that Occam’s Razor will miss people’s actual preferences because most preferences are unlikely to be the most simple explanation

-This article tries to rebut by pointing out that the paper’s argument proves too much by implying that physics models are degenerate

-I think that physics models are pretty obviously degenerate and I’m okay with us having degenerate models of physics. I’m not okay in general with degenerate models of what people prefer

If you want to explain why the multiple interpretations of QM are degenerate,the minimum number of examples you need is 2 not 1.

Not actually clear. If I had a really long list of factorials (of length n), then perhaps it could be “compressed” in terms of

fof 1 through n + a description off. However, it’s not clear how largenwould have to be for this to be, for that description to be shorter. Thus:is actually simpler, until E is big enough.

I don’t follow?