I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.

1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model

2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.

I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.

However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.

I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?

Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).

Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.

EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?

(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?

Correct.

My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.

I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.

It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.

Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.

I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.

“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.

Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).

I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?

(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having some kind of link”)

That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”

I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.

maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?

I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.

Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ’claims that have been reasonably well argued.

I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.

In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.

By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.

Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.

Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?

(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)

First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.

Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.

This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.

A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.

An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:

In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.

It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.

Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.

It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost all of them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).

However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)

Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.

You’ve heard of crucial considerations, but have you heard of red herring considerations?

These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.

To solve a problem quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.

For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.

Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).

When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”

Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.

I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).

In that example, X is “AI will not take over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.

I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.

EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.

In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.

Maybe another way to say this is:

I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)

“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)

Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.

Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:

Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)

Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.

At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).

Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.

----

What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.

The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.

A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)

However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.

I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?

Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.

(This is under the standard model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)

Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).

Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.

Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:

Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))

Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)

which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)

So, you must prefer the always-∅ trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.

----

Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?

Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.

Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)

As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))

Plugging this into the original equation recovers the divergence to negative infinity that we saw before.

But let’s assume that we just do a constant scaling to avoid this divergence:

RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)

Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))

The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))

=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)

The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:

=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])

So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.

The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let’s consider a model where there are clusters{ci}, where each cluster contains trajectories whose features are identical ci={τ:ϕ(τ)=ϕci} (which also implies rewards are identical). Let c(τ) denote the cluster that τ belongs to. The Boltzmann model says p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′)). The LESS model says p(τ∣θ)=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅1|c(τ)| , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.

(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these “clusters”; I’m introducing them as a simpler situation where we can understand what’s going on formally.)

In this model, a “sparse region of demonstration-space” is a cluster c with small cardinality |c|, whereas a dense one has large |c|.

Let’s first do some preprocessing. We can rewrite the Boltzmann model as follows:

Where for LESS p(c) is uniform i.e. p(c)∝1, whereas for Boltzmann p(c)∝|c|, i.e. a denser cluster is more likely to be sampled.

So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We’ll assume that LESS is the “correct” way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.

The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its “prior” over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn’t work—it only claims that pBoltzmann(τ)<pLESS(τ), but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let’s look at the reward learning update:

In the last step, any linear terms in p(τ∣θ) that didn’t depend on θ cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of “the prior is lower, therefore it updates more strongly” doesn’t seem to be reflected here.

Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose—the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster c(τ) it is in). So from now on I’ll just talk about selecting clusters, and updating on them. I’ll also write ERθ(c)=exp(Rθ(c)) for conciseness.

The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of p(c). So let’s consider just that last term. Denoting the vector of priors on all classes as →p, and similarly the vector of exponentiated rewards as →ERθ, the last term becomes →p⋅→ERθ2→p⋅→ERθ1=|→ERθ2||→ERθ1|⋅cos(α2)cos(α1), where αi is the angle between →p and →ERθi. Again, the first term doesn’t differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio cos(α2)cos(α1).

What happens when the chosen class c is sparse? Without loss of generality, let’s say that ERθ1(c)>ERθ2(c); that is, θ1 is a better fit for the demonstration, and so we will update towards it. Since c is sparse, p(c) is smaller for Boltzmann than for LESS—which probably means that it is better aligned with θ2, which also has a low value of ERθ2(c) by assumption. (However, this is by no means guaranteed.) In this case, the ratio cos(α2)cos(α1) above would be higher for Boltzmann than for LESS, and so it would more strongly update towards θ1, supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.

(Note it does make sense to analyze the effect on the θ that we update towards, because in reward learning we care primarily about the θ that we end up having higher probability on.)

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:

Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]

where s′ is the baseline state and g is the future goal.

In a deterministic environment, this can be rewritten as:

V∗future(s,s′)

=Eg[γmax(n,n′)]

=Eg[min(R(s,g),R(s′,g))]

=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]

=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]

=Eg[R(s′,g)]−dRR(s,s′)

Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):

This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):

Terminal states:

rnew(st)

=r(st)+βV∗future(st,s′t)

=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]

Non-terminal states:

rnew(st,at)

=r(st,at)+(1−γ)βV∗future(st,s′t)

=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]

For comparison, the original relative reachability reward is:

rRR(st,at)=r(st)−βdRR(st,s′t)

The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.

All of these rewards so far are for finite-horizon MDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:

Non-terminal states:

rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]

rRR(st,at)=r(st)−βdRR(st,s′t)

What used to be terminal states that are now self-loop states:

rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]

rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)

Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First, the future state rewards have an extra term, kEg[R(s′t,g)].

This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.

Second, in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

Summary: The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

L(θT)−L(θ0)=∑tL(θt)−L(θt−1)

And then to decompose training loss across specific parameters:

L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ

I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).

(This is pretty standard, but I’ve included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)

So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.

3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.

I think those are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.

----

Derivation of turning the path integral into a dot product with an average:

This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)

One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.

I don’t buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is ln(1/N)=−2.3 when N=10 (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we’d expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.

There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy “overwhelms” the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can’t be true. When training on just L2 regularization, the gradient descent update is:

w=w−λw=(1−λ)w=cw for some constant c.

For MLPs and CNNs with relu activations, if you multiply all the weights by a constant, the logits also get multiplied by a constant, no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can’t see a double descent on test error in this setting. (This doesn’t eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can’t happen in the “first train to zero error with cross-entropy and then regularize” setting.)

The paper tests with CNNs, but doesn’t mention what activation they use. Still, I’d find it very surprising if double descent only happened for a particular activation function.

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn’t exist any such thing. Often my reaction is “if only there was time to write an LW post that I can then link to in the future”. So far I’ve just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I’m now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they’re even understandable.

Consider two methods of thinking:

1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by “rolling out” that model

2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.

I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.

However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.

I think many people on LW tend to use option 1 almost always and my “deference” to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?

Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).

Options 1 & 2 sound to me a lot like inside view and outside view. Fair?

Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven’t explained them here.

EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won’t be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?

(My own take is the cop-out-like, “it depends”. I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you’ve put into it, etc.)

Correct.

I didn’t say you should defer to experts, just that if you try to build gears-y models you’ll be wrong. It’s totally possible that there’s no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.

Good point!

I recently interviewed someone who has a lot of experience predicting systems, and they had 4 steps similar to your two above.

Observe the world and see if it’s sufficient to other systems to predict based on intuitionistic analogies.

If there’s not a good analogy, Understand the first principles, then try to reason about the equilibria of that.

If that doesn’t work, Assume the world will stay in a stable state, and try to reason from that.

If that doesn’t work, figure out the worst case scenario and plan from there.

I think 1 and 2 are what you do with expertise, and 3 and 4 are what you do without expertise.

Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily “for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed”.

I’m also noticing I mean something slightly different by “expertise” than is typically meant. My intended meaning of “expertise” is more like “you have lots of data and observations about the system”, e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.

I like this experiment! Keep ’em coming.

“Burden of proof” is a bad framing for epistemics. It is not incumbent on others to provide exactly the sequence of arguments to make you believe their claim; your job is to

figure out whether the claim is true or not. Whether the other person has given good arguments for the claim does not usually have much bearing on whether the claim is true or not.Similarly, don’t say “I haven’t seen this justified, so I don’t believe it”; say “I don’t believe it, and I haven’t seen it justified” (unless you are specifically relying on absence of evidence being evidence of absence, which you usually should not be, in the contexts that I see people doing this).

I’m not 100% sure this needs to be much longer. It might actually be good to just make this a top-level post so you can link to it when you want, and maybe specifically note that if people have specific confusions/complaints/arguments that they don’t think the post addresses, you’ll update the post to address those as they come up?

(Maybe caveating the whole post under “this is not currently well argued, but I wanted to get the ball rolling on having

somekind of link”)That said, my main counterargument is: “Sometimes people are trying to change the status quo of norms/laws/etc. It’s not necessarily possible to review every single claim anyone makes, and it is reasonable to filter your attention to ‘claims that have been reasonably well argued.’”

I think ‘burden of proof’ isn’t quite the right frame but there is something there that still seems important. I think the bad thing comes from distinguishing epistemics vs Overton-norm-fighting, which are in fact separate.

I don’t really want this responsibility, which is part of why I’m doing all of these on the shortform. I’m happy for you to copy it into a top-level post of your own if you want.

I agree this makes sense, but then say “I’m not looking into this because it hasn’t been well argued (and my time/attention is limited)”, rather than “I don’t believe this because it hasn’t been well argued”.

In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.

By “selection”, I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.

Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.

Let’s say you’re trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?

(I’m assuming here that you can’t defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)

First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.

Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.

This soft minimum on time is going to depend on a bunch of things—how “hard” or “complex” or “high-dimensional” the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.

A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually “idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects”, without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be “I think you need to have thought about this for more time than you have before I expect you to do better than random”.

An incentive for property X (for humans) usually functions via selection, not via behavior change. A couple of consequences:

In small populations, even strong incentives for X may not get you much more of X, since there isn’t a large enough population for there to be much deviation on X to select on.

It’s pretty pointless to tell individual people to “buck the incentives”, even if they are principled people who try to avoid doing bad things, if they take your advice they probably just get selected against.

Sometimes people say “look at these past accidents; in these cases there were giant bureaucracies that didn’t care about safety at all, therefore we should be pessimistic about about AI safety”. I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.

This is not just one man’s modus ponens—the key issue is the selection effect.

It’s easiest to see with a Bayesian treatment. Let’s say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don’t care about safety? Almost

allof them—even if 90% of people care about safety, there will still be some cases where people didn’t care and accidents happened; and of course we’d hear about them if so (and not hear about the cases where accidents didn’t happen). You can get a strong update against 99.9999% and higher, but by the time you’re at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don’t learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many “potential accidents” there could have been).However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn’t have prevented, so I don’t have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and “all we have to do” is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)

Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.

You’ve heard of crucial considerations, but have you heard of red herring considerations?

These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn’t affect anything decision-relevant.

To solve a problem

quickly, it’s important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.For example, it might seem like “what is the right system of ethics” is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.

Here’s an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).

Alternate names: sham considerations? insignificant considerations?

When you make an argument about a person or group of people, often a useful thought process is “can I apply this argument to myself or a group that includes me? If this isn’t a type error, but I disagree with the conclusion, what’s the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?”

Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made—rather than thinking of a comment thread as “this is trying to ascertain whether X is true”, they seem to instead read the comments, perform some sort of inference over what the author must believe

if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).

An argument form that I like:

I think this should be convincing

even if Y is false, unless you can explain why your argument for X doesnotwork under assumption Y.An example: any AI safety story (X) should also work if you assume that the AI does not have the ability to take over the world during training (Y).

Trying to follow this. Doesn’t the Y (AI not taking over the world during training) make it less likely that X(AI will take over the world at all)?

Which seems to contradict the argument structure. Perhaps you can give a few more examples to make more clear the structure?

In that example, X is “AI will

nottake over the world”, so Y makes X more likely. So if someone comes to me and says “If we use <technique>, then AI will be safe”, I might respond, “well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>”.I don’t think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I’m explicitly trying for this to be a low-effort thing, so I’m not going to try to write more examples now.

EDIT: Actually, the double descent comment below has a similar structure, where X = “double descent occurs because we first fix bad errors and then regularize”, and Y = “we’re using an MLP / CNN with relu activations and vanilla gradient descent”.

In fact, the AUP power comment does this too, where X = “we can penalize power by penalizing the ability to gain reward”, and Y = “the environment is deterministic, has a true noop action, and has a state-based reward”.

Maybe another way to say this is:

I endorse applying the “X proves too much” argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is

notthe case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an “X proves too much” argument to an impossible scenario.)“Minimize AI risk” is not the same thing as “maximize the chance that we are maximally confident that the AI is safe”. (Somewhat related comment thread.)

Let’s say we’re talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.

Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here’s a fully general counterargument that Alice is wrong:

Decompose P into a series of conjunctions Q1, Q2, … Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)

Ask Alice to estimate P(Qk | Q1, Q2, … Q{k-1}) for all k.

At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).

Argue that Alice can’t possibly have enough knowledge to place under 1% on the negation of the statement.

----

What’s the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move

when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.

A particular flawed response is to look for N opinions that say “intervening is net negative” and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)

However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.

I didn’t get it… is the problem with the “look for N opinions” response that you aren’t computing the denominator (|”intervening is positive”| + |”intervening is negative”|)?

Yes, that’s the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.

(This is under the standard model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)

Under the standard setting, the optimizer’s curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer’s curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don’t already account for it).

Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to R on VR and QR.

RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)V∗(s)−Q∗(s,∅)

Consider some starting state s0, some starting action a0, and consider the optimal trajectory under R that starts with that, which we’ll denote as s0a0s1a1s2… . Define s′i=T(si−1,∅) to be the one-step inaction states. Assume that Q∗(s0,a0)>Q∗(s0,∅). Since all other actions are optimal for R, we have V∗(si)=1γ(V∗(si−1)−R(si−1))≥1γ(Q∗(si−1,∅)−R(si−1))=V∗(s′i) , so the max in the equation above goes away, and the total RAUP obtained is:

RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=2V∗(si)−V∗(s′i)V∗(si−1)−Q∗(si−1,∅))

Since we’re considering the optimal trajectory, we have V∗(si−1)−Q∗(si−1,∅)=[R(s)+γV∗(si)]−[R(s)+γV∗(s′i)]=γ(V∗(si)−V∗(s′i))

Substituting this back in, we get that the total RAUP for the optimal trajectory is RAUP(s0,a0)+(∞∑i=1γiR(si,ai))−λ(∞∑i=21γ)

which… uh… diverges to negative infinity, as long as γ<1. (Technically I’ve assumed that V∗(si)−V∗(s′i) is nonzero, which is an assumption that there is always an action that is better than ∅.)

So, you must prefer the always-∅ trajectory to this trajectory. This means that

no matter what the task is(well, as long as it has a state-based reward and doesn’t fall into a trap where ∅ is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird—surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.----

Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?

Let’s consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that ∅ is guaranteed to be a noop: for any state s, we have T(s,∅)=s.

Now, for any trajectory s0a0… with s′i defined as before, we have V∗(s′i)=V∗(si−1), so V∗(si−1)−Q∗(si−1,∅)=V∗(si−1)−[R(si−1)+γV∗(si−1)]=(1−γ)V∗(si−1)−R(si−1)

As a check, in the case where ai−1 is optimal, we have V∗(si)−V∗(s′i)=1γ(V∗(si−1)−R(si−1))−V∗(si−1)=1γ((1−γ)V∗(si−1)−R(si−1))

Plugging this into the original equation recovers the divergence to negative infinity that we saw before.

But let’s assume that we just do a constant scaling to avoid this divergence:

RAUP(s,a)=R(s)−λmax(V∗(T(s,a))−V∗(T(s,∅)),0)

Then for an

arbitrarytrajectory (assuming that the chosen actions are no worse than ∅), we get RAUP(si,ai)=R(s)−λ(V∗(si+1)−V∗(si))=R(s)−(λ(V∗(si+1))+(λV∗(si))The total reward across the trajectory is then (∞∑i=0γiR(si))−λ(∞∑i=1γi−1V∗(si))+λ(∞∑i=0γiV∗(si))

=(∞∑i=0γiR(si))−λV∗(s0)−λ∞∑i=1γi(1−γ)V∗(si)

The λV∗(s0) and R(s0) are constants and so don’t matter for selecting policies, so I’m going to throw them out:

=(∞∑i=1γi[R(si)−λ(1−γ)V∗(si)])

So in deterministic environments with state-based rewards where ∅ is a true noop (even the environment doesn’t evolve), AUP with constant scaling is equivalent to adding a penalty Penalty(s)=kV∗(s) for some constant k; that is, we’re effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to R). Again, this seems much more like satisficing or quantilization than impact / power measurement.

The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let’s consider a model where there are

clusters{ci}, where each cluster contains trajectories whose features are identical ci={τ:ϕ(τ)=ϕci} (which also implies rewards are identical). Let c(τ) denote the cluster that τ belongs to. The Boltzmann model says p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′)). The LESS model says p(τ∣θ)=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅1|c(τ)| , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these “clusters”; I’m introducing them as a simpler situation where we can understand what’s going on formally.)

In this model, a “sparse region of demonstration-space” is a cluster c with small cardinality |c|, whereas a dense one has large |c|.

Let’s first do some preprocessing. We can rewrite the Boltzmann model as follows:

p(τ∣θ)=exp(Rθ(τ))∑τ′exp(Rθ(τ′))=exp(Rθ(c(τ)))∑c′exp(Rθ(c′))⋅|c′|=|c(τ)|⋅exp(Rθ(c(τ)))∑c′|c′|⋅exp(Rθ(c′))⋅1|c(τ)|

This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:

p(τ∣θ)=p(c(τ))⋅exp(Rθ(c(τ)))∑c′p(c′)exp(Rθ(c′))⋅1|c(τ)|

Where for LESS p(c) is uniform i.e. p(c)∝1, whereas for Boltzmann p(c)∝|c|, i.e. a denser cluster is more likely to be sampled.

So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We’ll assume that LESS is the “correct” way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.

The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its “prior” over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn’t work—it only claims that pBoltzmann(τ)<pLESS(τ), but in order to do a Bayesian update you need to consider likelihood

ratios. To see this more formally, let’s look at the reward learning update:p(θ∣τ)=p(θ)⋅p(τ∣θ)∑θ′p(θ′)⋅p(τ∣θ′)=p(θ)⋅exp(Rθ(c(τ)))∑c′p(c′)exp(Rθ(c′))∑θ′p(θ′)⋅exp(Rθ′(c(τ)))∑c′p(c′)exp(Rθ′(c′)).

In the last step, any linear terms in p(τ∣θ) that didn’t depend on θ cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of “the prior is lower, therefore it updates more strongly” doesn’t seem to be reflected here.

Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose—the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster c(τ) it is in). So from now on I’ll just talk about selecting clusters, and updating on them. I’ll also write ERθ(c)=exp(Rθ(c)) for conciseness.

p(θ∣c)=p(θ)⋅ERθ(c)∑c′p(c′)ERθ(c′)∑θ′p(θ′)⋅ERθ′(c)∑c′p(c′)ERθ′(c′) .

This is a horrifying mess of an equation. Let’s switch to odds:

p(θ1∣c)p(θ2∣c)=p(θ1)p(θ2)⋅ERθ1(c)ERθ2(c)⋅∑c′p(c′)ERθ2(c′)∑c′p(c′)ERθ1(c′)

The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of p(c). So let’s consider just that last term. Denoting the vector of priors on all classes as →p, and similarly the vector of exponentiated rewards as →ERθ, the last term becomes →p⋅→ERθ2→p⋅→ERθ1=|→ERθ2||→ERθ1|⋅cos(α2)cos(α1), where αi is the angle between →p and →ERθi. Again, the first term doesn’t differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio cos(α2)cos(α1).

What happens when the chosen class c is sparse? Without loss of generality, let’s say that ERθ1(c)>ERθ2(c); that is, θ1 is a better fit for the demonstration, and so we will update towards it. Since c is sparse, p(c) is smaller for Boltzmann than for LESS—which

probablymeans that it is better aligned with θ2, which also has a low value of ERθ2(c) by assumption. (However, this is by no means guaranteed.) In this case, the ratio cos(α2)cos(α1) above would be higher for Boltzmann than for LESS, and so it would more strongly update towards θ1, supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.(Note it does make sense to analyze the effect on the θ that we update towards, because in reward learning we care primarily about the θ that we end up having higher probability on.)

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability R(s1,s2)=Eτ∼π[γn], where π is the optimal policy for getting from s1 to s2, and n=|τ| is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value V∗future(s,s′) is:

Eg,τ∼πg,τ′∼π′g[γmax(n,n′)]

where s′ is the baseline state and g is the future goal.

In a deterministic environment, this can be rewritten as:

V∗future(s,s′)

=Eg[γmax(n,n′)]

=Eg[min(R(s,g),R(s′,g))]

=Eg[R(s′,g)−max(R(s′,g)−R(s,g),0)]

=Eg[R(s′,g)]−Eg[max(R(s′,g)−R(s,g),0)]

=Eg[R(s′,g)]−dRR(s,s′)

Here, dRR is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s’. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of V∗future instead of their V∗i):

V∗(st)=maxat∈A[r(st,at)+γ∑st+1∈Sp(st+1∣st,at)V∗(st+1)+(1−γ)βV∗future].

This is the regular Bellman equation, but with the following augmented reward (here s′t is the baseline state at time t):

Terminal states:

rnew(st)

=r(st)+βV∗future(st,s′t)

=r(st)−βdRR(st,s′t)+βEg[R(s′t,g)]

Non-terminal states:

rnew(st,at)

=r(st,at)+(1−γ)βV∗future(st,s′t)

=r(st)−(1−γ)βdRR(st,s′t)+(1−γ)βEg[R(s′t,g)]

For comparison, the original relative reachability reward is:

rRR(st,at)=r(st)−βdRR(st,s′t)

The first and third terms in rnew are very similar to the two terms in rRR. The second term in rnew only depends on the baseline.

All of these rewards so far are for

finite-horizonMDPs (at least, that’s what it sounds like from the paper, and if not, they could be anyway). Let’s convert them to infinite-horizon MDPs (which will make things simpler, though that’s not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (1−γ) (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define k=β(1−γ) for convenience. Then, we have:Non-terminal states:

rnew(st,at)=r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]

rRR(st,at)=r(st)−βdRR(st,s′t)

What used to be terminal states that are now self-loop states:

rnew(st,at)=(1−γ)r(st)−kdRR(st,s′t)+kEg[R(s′t,g)]

rRR(st,at)=(1−γ)r(st)−kdRR(st,s′t)

Note that all of the transformations I’ve done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We’re ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First,the future state rewards have an extra term, kEg[R(s′t,g)].This term depends only on the baseline s′t. For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn’t matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals g that involve sushi.

Second,in non-terminal states, relative reachability weights the penalty by β instead of k=β(1−γ). Really since β and thus k is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from β in non-terminal states to the smaller β(1−γ) in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it’s a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)Summary:The actual effects of the new paper’s framing 1. removes the “extra” incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.(That said, it starts from a very different place than the original RR paper, so it’s interesting that they somewhat converge here.)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

L(θT)−L(θ0)=∑tL(θt)−L(θt−1)

And then to decompose training loss across specific parameters:

L(θt)−L(θt−1)=→θt∫→θt−1→∇→θL(→θ)⋅d→θ

I’ve added vector arrows to emphasize that θ is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We’ll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

L(θt)−L(θt−1)=(θt−θt−1)⋅Averagett−1(∇L(θ)).

(This is pretty standard, but I’ve included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

L(θt)−L(θt−1)=∑i(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as A(i)t=(θ(i)t−θ(i)t−1)Averagett−1(∇L(θ))(i)

So based on this, I’m going to define my own version of LCA, called LCANaive. Suppose the gradient computed at training iteration t is Gt (which is a vector). LCANaive uses the approximation Averagett−1(∇L(θ))≈Gt−1, giving A(i)t,Naive=(θ(i)t−θ(i)t−1)G(i)t−1 . But the SGD update is given by θ(i)t=θ(i)t−1−αG(i)t−1 (where α is the learning rate), which implies that A(i)t,Naive=(−αG(i)t−1)G(i)t−1=−α(G(i)t−1)2, which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn’t surprising—we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What’s up with that? There are a few differences between LCANaive and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2.LCANaive approximates the average gradient with the training gradient, which is only calculated on a

minibatchof data. LCA uses the loss on thefull training dataset.3.LCANaive uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between θt and θt−1 to reduce the approximation error.

I

thinkthose are the only differences (though it’s always hard to tell if there’s some unmentioned detail that creates another difference), which means that whenever the paper says “these parameters had positive LCA”, that effect can be attributed to some combination of the above 3 factors.----

Derivation of turning the path integral into a dot product with an average:

L(θt)−L(θt−1)=limn→∞n−1∑k=0(∇L(θt−1+kΔθ)⋅Δθ)where Δθ=1n(θt−θt−1)

=limn→∞nΔθ⋅(1nn−1∑k=0∇L(θt−1+kΔθ))

=limn→∞(θt−θt−1)⋅(1nn−1∑k=0∇L(θt−1+kΔθ))

=(θt−θt−1)⋅Averagett−1(∇L(θ)) , where the average is defined as limn→∞(1nn−1∑k=0∇L(θt−1+kΔθ)) .

In my double descent newsletter, I said:

One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.

I don’t buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is ln(1/N)=−2.3 when N=10 (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we’d expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.

There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy “overwhelms” the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can’t be true. When training on just L2 regularization, the gradient descent update is:

w=w−λw=(1−λ)w=cw for some constant c.

For MLPs and CNNs with relu activations, if you multiply all the weights by a constant, the logits also get multiplied by a constant, no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can’t see a double descent on test error in this setting. (This doesn’t eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can’t happen in the “first train to zero error with cross-entropy and then regularize” setting.)

The paper tests with CNNs, but doesn’t mention what activation they use. Still, I’d find it very surprising if double descent only happened for a particular activation function.