# Jeremy Gillen

Karma: 596
• Section 4 then showed how those initial results extend to the case of sequential decision making.

[...]

If she’s a resolute chooser, then sequential decisions reduce to a single non-sequential decisions.

Ah thanks, this clears up most of my confusion, I had misunderstood the intended argument here. I think I can explain my point better now:

I claim that proposition 3, when extended to sequential decisions with a resolute decision theory, shouldn’t be interpreted the way you interpret it. The meaning changes when you make A and B into sequences of actions.

Let’s say action A is a list of 1000000 particular actions (e.g. 1000000 small-edits) and B is a list of 1000000 particular actions (e.g. 1 improve-technology, then 999999 amplified-edits).[1]

Proposition 3 says that A is equally likely to be chosen as B (for randomly sampled desires). This is correct. Intuitively this is because A and B are achieving particular outcomes and desires are equally likely to favor “opposite” outcomes.

However this isn’t the question we care about. We want to know whether action-sequences that contain “improve-technology” are more likely to be optimal than action-sequences that don’t contain “improve-technology”, given a random desire function. This is a very different question to the one proposition 3 gives us an answer to.

Almost all optimal action-sequences could contain “improve-technology” at the beginning, while any two particular action sequences are equally likely to be preferred to the other on average across desires. These two facts don’t contradict each other. The first fact is true in many environments (e.g. the one I described[2]) and this is what we mean by instrumental convergence. The second fact is unrelated to instrumental convergence.

I think the error might be coming from this definition of instrumental convergence:

could we nonetheless say that she’s got a better than probability of choosing from a menu of acts?

When is a sequence of actions, this definition makes less sense. It’d be better to define it as something like “from a menu of initial actions, she has a better than probability of choosing a particular initial action ”.

I’m not entirely sure what you mean by “model”, but from your use in the penultimate paragraph, I believe you’re talking about a particular decision scenario Sia could find herself in.

Yep, I was using “model” to mean “a simplified representation of a complex real world scenario”.

1. ^

For simplicity, we can make this scenario a deterministic known environment, and make sure the number of actions available doesn’t change if “improve-technology” is chosen as an action. This way neither of your biases apply.

2. ^

E.g. we could define a “small-edit” as to any location in the state vector. Then an “amplified-edit” as to any location. This preserves the number of actions, and makes the advantage of “amplified-edit” clear. I can go into more detail if you like, this does depend a little on how we set up the distribution over desires.

• I read about half of this post when it came out. I didn’t want to comment without reading the whole thing, and reading the whole thing didn’t seem worth it at the time. I’ve come back and read it because Dan seemed to reference it in a presentation the other day.

The core interesting claim is this:

My conclusion will be that most of the items on Bostrom’s laundry list are not ‘convergent’ instrumental means, even in this weak sense. If Sia’s desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.

This conclusion doesn’t follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list.

The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.

Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don’t see why you didn’t just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.

Here’s my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having “random” goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on . Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it’s very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on.

The assumptions I made about the environment are inspired by the real world environment, and the assumptions I’ve made about the desires are similar to yours, maximally uninformative over trajectories.

• I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.

Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.

If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.

• You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there’s some selection pressure against you if selection is based on money.
We can (kinda) turn this into a money-pump by charging the agent a fee for to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing.
Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like “adversarial agents can cheaply construct situations where I pay them money, but the world isn’t actually different”.

# AISC team re­port: Soft-op­ti­miza­tion, Bayes and Goodhart

27 Jun 2023 6:05 UTC
35 points
• With my linear algebra being terrible, I was confused by this:

Until I realized that and are basis vectors and are coordinates on a unit circle, because , and all have length 1.

• Good point on CDT, I forgot about this. I was using a more specific version of reflective stability.

> - wait.. that doesn’t seem right..?

Yeah this is also my reaction. Assuming that bound seems wrong.

I think there is a problem with thinking of as a known-to-be-acceptably-safe agent, because how can you get this information in the first place? Without running that agent in the world? To construct a useful estimate of the expected value of the “safe”-agent, you’d have to run it lots of times, necessarily sampling from it’s most dangerous behaviours.

Unless there is some other non-empirical way of knowing an agent is safe?

Yeah I was thinking of having large support of the base distribution. If you just rule-in behaviours, this seems like it’d restrict capabilities too much.

• Quantilizing can be thought of as maximizing a lower bound on the expected true utility, where you know that your true utility is close to your proxy utility function in some region , such that . If we shape this closeness assumption a bit differently, such that the approximation gets worse faster, then sometimes it can be optimal to cut off the top of the distribution (as I did here, see some of the diagrams for quantilizers with the top cut off, I’ll paste one below).

The reason normal quantilizers don’t do that is that they are minimizing the distance between and the action distribution, by a particular measure that falls out of the proof (see above link), which allows the lower bound to be as high as possible. Essentially it’s minimizing distribution shift, which allows a better generalization bound.

I think this distribution shift perspective is one way of explaining why we need randomization at all. A delta function is a bigger distribution shift than a distribution that matches the shape of .
But the next question is why are we even in a situation where we need to deal with the worst case across possible true utility functions? One story is that we are dealing with an optimizer that is maximizing trueutility + error, and one way to simplify that is to model it as max min trueutility—error, where the min only controls the error function within the restrictions of the known bound.

I’m not currently super happy with that story and I’m keen for people to look for alternatives, or variations of soft optimization with different types of knowledge about the relationship between the proxy and true utility. Because intuitively it does seem like taking the 99%ile action should be fine under slightly different assumptions.

One example of this is if we know that , where is some heavy tailed noise, and we know the distribution of (and ), then we can calculate the actual optimal percentile action to take, and we should deterministically take that action. But this is sometimes quite sensitive to small errors in our knowledge about the distribution of and particularly . My AISC team has been testing scenarios like this as part of their research.

• I really like infrafunctions as a way of describing the goals of mild optimizers. But I don’t think you’ve described the correct reasons why infrafunctions help with reflective stability. The main reason is you’ve hidden most of the difficulty of reflective stability in the bound.

My core argument is that a normal quantilizer is reflectively stable[1] if you have such a bound. In the single-action setting, where it chooses a policy once at the beginning and then follows that policy, it must be reflectively stable because if the chosen policy constructs another optimizer that leads to low true utility, then that policy must have very low base probability (or the bound can’t have been true). In a multiple-action setting, we can sample each action conditional on the previous actions, according to the quantilizer distribution, and this will be reflectively stable in the same way (given the bound).

Adding in observations doesn’t change anything here if we treat U and V as being expectations over environments.

The way you’ve described reflective stability in the dynamic consistency section is an incentive to keep the same utility infrafunction no matter what observations are made. I don’t see how this is necessary or even strongly related to reflective stability. Can’t we have a reflectively stable CDT agent?

Two core difficulties of reflective stability

I think the two core difficulties of reflective stability are 1) getting the bound (or similar) and 2) describing an algorithm that lazily does a ~minimal amount of computation for choosing the next few actions. I expect realistic agents need 2 for efficiency. I think utility infrafunctions do help with both of these, to some extent.

The key difficulty of getting a tight bound with normal quantilizers is that simple priors over policies don’t clearly distinguish policies that create optimizers. So there’s always a region at the top where “create an optimizer” makes up most of the mass. My best guess for a workaround for this is to draw simple conservative OOD boundaries in state-space and policy-space (the base distribution is usually just over policy space, and is predefined). When a boundary is crossed, it lowers the lower bound on the utility (gives Murphy more power). These boundaries need to be simple so that they can be learned from relatively few (mostly in-distribution) examples, or maybe from abstract descriptions. Being simple and conservative makes them more robust to adversarial pressure.

Your utility infrafunction is a nice way to represent lots of simple out-of-distribution boundaries in policy-space and state-space. This is much nicer than storing this information in the base distribution of a quantilizer, and it also allows us to modulate how much optimization pressure can be applied to different regions of state or policy-space.

With 2, an infrafunction allows on-the-fly calculation that the consequences of creating a particular optimizer are bad. It can do this as long as the infrafunction treats the agent’s own actions and the actions of child-agents as similar, or if it mostly relies on OOD states as the signal that the infrafunction should be uncertain (have lots of low spikes), or some combination of these. Since the max-min calculation is the motivation for randomizing in the first place, an agent that uses this will create other agents that randomize in the same way. If the utility infrafunction is only defined over policies, then it doesn’t really give us an efficiency advantage because we already had to calculate the consequences of most policies when we proved the bound.

One disadvantage, which I think can’t be avoided, is that an infrafunction over histories is incentivized to stop humans from doing actions that lead to out-of-distribution worlds, whereas an infrafunction over policies is not (to the extent that stopping humans doesn’t itself cross boundaries). This seems necessary because it needs to consider the consequences of the actions of optimizers it creates, and this generalizes easily to all consequences since it needs to be robust.

1. ^

Where I’m defining reflective stability as: If you have an anti-Goodhart modification in your decision process (e.g. randomization), ~never follow a plan that indirectly avoids the anti-Goodhart modification (e.g. making a non-randomized optimizer).

The key difficulty here being that the default pathway for achieving a difficult task involves creating new optimization procedures, and by default these won’t have the same anti-Goodhart properties as the original.

• Thanks!

1. I think it’s more accurate to say it’s incomplete. And the standard generalization bound math doesn’t make that prediction as far as I’m aware, it’s just the intuitive version of the theory that does. I’ve been excited by the small amount of singular learning theory stuff I’ve read. I’ll read more, thanks for making that page.

2. Fantastic!

• No, Justin knows roughly the content for the intended future posts but after getting started writing I didn’t feel like I understood it well enough to distill it properly and I lost motivation, and since then I became too busy.
I’ll send you the notes that we had after Justin explained his ideas to me.

• Paperclip metaphor is not very useful if interpreted as “humans tell the AI to make paperclips, and it does that, and the danger comes from doing exactly what we said because we said a dumb goal”.

There is a similar-ish interpretation, which is good and useful, which is “if the AI is going to do exactly what you say, you have to be insanely precise when you tell it what to do, otherwise it will Goodhart the goal.” The danger comes from Goodharting, rather than humans telling it a dumb goal. The paperclip example can be used to illustrate this, and I think this is why it’s commonly used.

And he is referencing in the first tweet (with inner alignment), that we will have very imprecise (think evolution-like) methods of communicating a goal to an AI-in-training.

So apparently he intended the metaphor to communicate that the AI-builders weren’t trying to set “make paperclips” as the goal, they were aiming for a more useful goal and “make paperclips” happened to be the goal that it latched on to. Tiny molecular squiggles is better here because it’s a more realistic optima of an imperfectly learned goal representation.

• On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.

• On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.

• If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/​q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.

• Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I’ll remove that paragraph. Thanks for the links, I hadn’t read those, and I appreciate the pseudocode.

I think most likely I still don’t understand what you mean by grader-optimizer, but it’s probably better to discuss on your post after I’ve spent more time going over your posts and comments.

My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?

• I also think that it’s probably worth considering soft optimization to the old Impact Measures work from this community—in particular, I think it’d be interesting to cast soft optimization methods as robust optimization, and then see how the critiques raised against impact measures (e.g. in this comment or this question) apply to soft optimization methods like RL-KL or the minimax objective you outline here.

Thanks for linking these, I hadn’t read most of these. As far as I can tell, most of the critiques don’t really apply to soft optimization. The main one that does is Paul’s “drift off the rails” thing. I expect we need to use the first AGI (with soft opt) to help solve alignment in a more permanent and robust way, then use that make a more powerful AGI that helps avoid “drifting off the rails”.

In my understanding, impact measures are an important part of the utility function that we don’t want to get wrong, but not much more than that. Whereas soft optimization directly removes Goodharting of the utility function. It feels like the correct formalism for attacking the root of that problem. Whereas impact measures just take care of a (particularly bad) symptom.

Abram Demski has a good answer to the question you linked that contrasts mild optimization with impact measures, and it’s clear that mild optimization is preferred. And Abram actually says:

An improvement on this situation would be something which looked more like a theoretical solution to Goodhart’s law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities (“this is how you get the most of what you want”), allowing ML researchers to develop algorithms orienting toward this.

This is exactly what I’ve got.

• I agree that it’s good to try to answer the question, under what sort of reliability guarantee is my model optimal, and it’s worth making the optimization power vs robustness trade off explicit via toy models like the one you use above.

That being said, re: the overall approach. Almost every non degenerate regularization method can be thought of as “optimal” wrt some robust optimization problem (in the same way that non degenerate optimization can be trivially cast as Bayesian optimization) -- e.g. the RL—KL objective with respect to some is optimal the following minimax problem:

for some . So the question is not so much “do we cap the optimization power of the agent” (which is a pretty common claim!) but “which way of regularizing agent policies more naturally captures the robust optimization problems we want solved in practice”.

Yep, agreed. Except I don’t understand how you got that equation from RL with KL penalties, can you explain that further?

I think the most novel part of this post is showing that this robust optimization problem (maximizing average utility while avoiding selection for upward errors in the proxy) is the one we want to solve, and that it can be done with a bound that is intuitively meaningful and can be determined without just guessing a number.

(It’s also worth noting that an important form of implicit regularization is the underlying capacity/​capability of the model we’re using to represent the policy.)

Yeah I wouldn’t want to rely on this without a better formal understanding of it though. KL regularization I feel like I understand.

• I’ve probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn’t assume the human’s utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There’s no mechanism for updating that distribution, so it can’t become arbitrarily certain about the utility function.

And that distribution isn’t treated like epistemic uncertainty, it’s used to find a worst case lower bound on utility?

• Good point, policies that have upward errors will still be preferentially selected for (a little). However, with this approach, the amount of Goodharting should be constant as the proxy quality (and hence optimization power) scales up.

I agree with your second point, although I think there’s a slight benefit over original quantilizers because is set theoretically, rather than arbitrarily by hand. Hopefully this makes it less tempting to mess with it.

# Soft op­ti­miza­tion makes the value tar­get bigger

2 Jan 2023 16:06 UTC
114 points