Makes perfect sense, thanks!

# Vivek Hebbar

“Well, what if I take the variables that I’m given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I’m given, and consider the space of all partitions on that product of variables that I’m given; and each one of those partitions will be its own variable.

How can a partition be a variable? Should it be “part” instead?

**ETA: Koen recommends reading****Counterfactual Planning in AGI Systems**first (or instead of) Corrigibility with Utility Preservation)Update: I started reading your paper “Corrigibility with Utility Preservation”.

^{[1]}My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining “superintelligent” as “optimal utility maximizer”.Quick thoughts after reading less than half:

AFAICT,

^{[2]}this is a mathematical solution to corrigibility in a toy problem, and*not*a solution to corrigibility in real systems. Nonetheless, it’s a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.^{[3]}Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).So looking at your claim that “the technical problem [is] mostly solved”, this may or may not be true for the narrow sense (like “corrigibility as a theoretical outer-objective problem in formally-specified environments”), but seems false and misleading for the broader practical sense (“knowing how to make an AGI corrigible in real life”).

^{[4]}Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt

^{[5]}:“In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent’s decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model.”

- ^
Btw, your writing is admirably concrete and clear.

Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo “I this paper, we the running example of a toy universe” on page 4.

- ^
Assuming the idea is correct

- ^
Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?

- ^
I’m not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to be as someone who read your comment but not the contest details.

- ^
Portions in [brackets] are insertions/replacements by me

- ^

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:

The agent’s allegiance is to some idealized utility function (like CEV). The agent’s internal evaluator is “trying” to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on . Hence, Eval concludes that is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.

“Improving ” makes sense because Eval will gladly replace itself with if it believes that is a better approximation for (and hence replacing itself will cause the outcome to score better on )

Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.

Yeah, the right column should obviously be all 20s. There must be a bug in my code

^{[1]}:/I like to think of the argmax function as something that takes in a distribution on probability distributions on with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into with weight , then the middle column is still nearly zero. But I can now ask for the probablity of the event in corresponding to the center square, and I get back an answer very close to zero. Where did this confidence come from?

I guess I’m basically wondering what this procedure is aspiring to be. Some candidates I have in mind:

Extension to the coarse case of regular hypothesis mixing (where we go from P(w) and Q(w) to )

Extension of some kind of Bayesian update-flavored thing where we go to then renormalize

ETA: seems more plausible than

Some kind of “aggregation of experts who we trust a lot unless they contradict each other”, which isn’t cleanly analogous to either of the above

Even in case 3, the near-zeros are really weird. The only cases I can think of where it makes sense are things like “The events are outcomes of a quantum process. Physics technique 1 creates hypothesis 1, and technique 2 creates hypothesis 2. Both techniques are very accurate, and the uncertainity they express is due to fundamental unknowability. Since we know both tables are correct, we can confidently rule out the middle column, and thus rule out certain events in hypothesis 3.”

But more typically, the uncertainity is in the maps of the respective hypotheses, not in the territory, in which case the middle zeros seem unfounded. And to be clear, the reason it seems like a real issue

^{[2]}is that when you add in hypothesis 3 you have events in the middle which you can query, but the values can stay arbitrarily close to zero if you add in hypothesis 3 with low weight.

Now, let’s consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what does the merge operation represent, which causes this to make sense? (Maybe your reply is just “wait for the next post”)

most egregores/epistemic networks, which I’m completely reliant upon, are much smarter than me, so that can’t be right

**Egregore smiles**

Another way of looking at this question: Arithmetic rationality is shift invariant, so you don’t have to know your total balance to calculate expected values of bets. Whereas for geometric rationality, you need to know where the zero point is, since it’s not shift invariant.

Which is equivalent to

**Some results related to logarithmic utility and stock market leverage**(I derived these after reading your previous post, but I think it fits better here)**:****Tl;dr:**We can derive the optimal stock market leverage for an agent with utility logarithmic in money. We can also back-derive a utility function from any constant leverage^{[1]}, giving us a nice class of utility functions with different levels of risk-aversion. Logarithmic utility is recovered a special case, and has additional nice properties which the others may or may not have.For an agent investing in a stock whose “instantaneous” price movements are i.i.d. with finite moments:

Suppose, for simplicity, that the agent’s utility function is over the

*amount of money they have in the next timestep.*(As opposed to more realistic cases like “amount they have 20 years from now”.)If , then:

The optimal leverage for the agent to take is given by the formula , where and s is the standard deviation of the same. Derivation here. By my calculations, this implies a leverage of about 1.8 on the S&P 500.

What if we instead suppose the agent prefers some constant leverage , and try to infer it’s utility function?

The relevant differential equation is

This is solved by for and for . You can play with the solutions here.

Now suppose instead that the agent’s utility function is “logarithmic withdrawals, time-discounted exponentially” -- , where is the absolute

^{[2]}rate of withdrawal at time . It turns out that optimal leverage is still constant, and is still given by the same formula . Furthermore, the optimal rate of withdrawal is a constant , regardless of what happens.Things probably don’t work out as cleanly for the non-logarithmic case.

[Disclaimer: This is not investment advice.]

- ^
Caveats:

1. This assumption of constant leverage is pretty arbitrary, so there’s no normative or descriptive force to the class of utility functions we derive from it

2. We have to make an unrealistic assumption that the utility function is over $$ at the next timestep, rather than further in the future. In the log case, these kind of assumptions tend to not change anything, but I’m not sure whether the general case is as clean.

- ^
i.e. in dollars, not percents

Results on logarithmic utility and stock market leverage: https://www.lesswrong.com/posts/DMxe4XKXnjyMEAAGw/the-geometric-expectation?commentId=yuRie8APN8ibFmRJD

A framing I wrote up for a debate about “alignment tax”:

**“Alignment isn’t solved” regimes:**Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute

We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one

**“Alignment tax” regimes:**We can make an aligned AGI, but it requires a compute overhead in the range 1% − 100x. Furthermore, the situation remains multipolar and competitive for a while.

The alignment tax is <0.001%, so it’s not a concern.

The leading coalition is further ahead than the alignment tax amount, and can and will execute a pivotal act, thus ending the risk period and rendering the alignment tax irrelevant.

A person whose mainline is {1a --> 1b --> 2b or 2c} might say “alignment is unsolved, solving it mostly a discrete thing, and alignment taxes and multipolar incentives aren’t central”

Whereas someone who thinks we’re already in 2a might say “alignment isn’t hard, the problem is incentives and competitiveness”

Someone whose mainline is {1a --> 2a} might say “We need to both ‘solve alignment at all’ AND either get the tax to be really low or do coordination. Both are hard, and both are necessary.”

# Vivek Hebbar’s Shortform

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

Good point

`Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]`

is the probability that, for a random policy`π∈ξ`

, that policy has worse utility than the policy`G*`

its program dictates; in essence, how good`G`

’s policies are compared to random policy selection

What prior over policies?

given

`g(G|U)`

, we can**infer the probability that an agent****G****has a given utility function**

, as**U**`Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`∝`

means “is proportional to” and`K(U)`

is the kolmogorov complexity of utility function`U`

.Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of “actions” is greater than the number of bits it takes to specify my brain

^{[1]}, it seems like it would conclude that my utility function is something like “1 if {acts exactly like [insert exact copy of my brain] would}, else 0″.- ^
Idk if this is plausible

Consider this quote by Alan Watts:

*There are basically two kinds of philosophy. One’s called prickles, the other’s called goo. And prickly people are precise, rigorous, logical. They like everything chopped up and clear. Goo people like it vague. For example, in physics, prickly people believe that the ultimate constituents of matter are particles. Goo people believe it’s waves.**facepalm*

If one has a technical understanding of QFT

^{[1]}(or even half of technical understanding, like me), this sounds totally silly. There’s no*real*question as to whether things are fundamentally “particles” or “waves”. There’s nothing “goo” about waves either; waves refer to mathematically precise phenomena. Physics is precise and mathematical (“prickly” if you like). Anyone who understands QFT must have a prickly side, and can deal with both waves and particles without confusion. Those who think physics can be “vague”*don’t understand the physics*.(Critiquing Watts of course, not primarily the author of this post)

- ^
Quantum field theory

- ^

In theory, there can be multiple disconnected manifolds like this.

This read to me like you were saying “In Vivek’s framework, value shards explain away ..” and I was confused. I now think you mean “My take on Vivek’s is that value shards explain away ..”. Maybe reword for clarity?

(Might have a substantive reply later)