AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

# Vanessa Kosoy(Vanessa Kosoy)

So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters.

I’m not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imagine you’ll agree… It’s hard to picture that I have some true platonic utility function.

Actually I am rather skeptical/agnostic on this. For me it’s fairly easy to picture that I have a “platonic” utility function, except that the time discount is dynamically inconsistent (not exponential).

I am in favor of exploring models of preferences which admit all sorts of uncertainty and/or dynamic inconsistency, but (i) it’s up to debate how much degrees of freedom we need to allow there and (ii) I feel that the case logical induction is the right framework for this is kinda weak (but maybe I’m missing something).

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don’t know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren’t always necessary).

Ah, but there is a sense in which it doesn’t. The radical update rule is equivalent to updating on “secret evidence”. And in TRL we have such secret evidence. Namely, if we only look at the agent’s beliefs about “physics” (the environment), then they would be updated radically, because of secret evidence from “mathematics” (computations).

What if the super intelligent deity is less than maximally evil or maximally good? (E.g. the deity picking the median-performance world)

Thinking of the worst-case is just a mathematical reflection of the fact we want to be able to prove

*lower bounds*on the expected utility of our agents. We have an unpublished theorem that, in some sense,*any*such lower bound guarantee has an infra-Bayesian formulation.Another way to justify it is the infra-Bayesian CCT (see “Complete Class Theorem Weak Version” here).

What about the dutch-bookability of infraBayesians? (the classical dutch-book arguments seem to suggest pretty strongly that non-classical-Bayesians can be arbitrarily exploited for resources)

I think it might depend on the specific Dutch book argument, but

*one*way infra-Bayesians escape them is by… being equivalent to certain Bayesians! For example, consider the setting where your agent has access to random bits that the environment can’t predict. Then, infra-Bayesian behavior is just the Nash equilibrium in a two-player zero-sum game (against Murphy). Now, the Nash strategy in such a game is the (Bayes) optimal response to the Nash strategy of the other player, so it can be regarded as “Bayesian”. However, the converse is false: not every best response to Nash is in itself Nash. So, the infra-Bayesian decision rule is more restrictive than the corresponding Bayesian decision rule, but it’s a special case of the latter.Is there a meaningful metaphysical interpretation of infraBayesianism that does not involve Murphy? (similarly to how Bayesianism can be metaphysically viewed as “there’s a real, static world out there, but I’m probabilistically unsure about it”)

I think of it as just another way of organizing uncertainty. The question is too broad for a succinct answer, I think, but here’s

*one*POV you could take: Let’s remember the frequentist definition of probability distributions as time limits of frequencies. Now, what if the time limit doesn’t converge? Then, we can make a (crisp) infradistribution instead: the convex hull of all limit points. Classical frequentism also has the problem that the exact same event never repeats itself. But in “infra-frequentism” we can solve this: you don’t need the exact same event to repeat, you can draw the boundary around what counts as “the event” any way you like.Once we go from passive observation to active interaction with the environment, your own behavior serves as

*another*source of Knightian uncertainty. That is, you’re modeling the world in terms of certain features while ignoring everything else, but the state of everything else depends on your past behavior (and you don’t want to explicitly keep track of that). This line of thought can be formalized in the language of infra-MDPs (unpublished). And then ofc you complement this “aleatoric” uncertainty with “epistemic” uncertainty by considering the mixture of many infra-Bayesian hypotheses.

There is a formal sense in which “predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance”, see our latest post. It’s similar to MUDT, where, if you prove a contradiction then you can prove utility is as high as you like.

The exact same thing is true for classical probability theory: you have distributions, mixtures of distributions and linear functionals respectively. So I’m not sure what new difficulty comes from infra-Bayesianism?

Maybe it would help thinking about infra-MDPs and infra-POMDPs?

Also, here I wrote about how you could construct an infra-Bayesian version of the Solomonoff prior, although possibly it’s better to do it using infra-Bayesian logic.

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable—IE it can oscillate in some cases?)… AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It’s true that these questions still need work, but I think it’s rather clear that something like “there are no traps” is a sufficient condition for learnability. For example, if you have a finite set of “episodic” hypotheses (i.e. time is divided into episodes, and no states is preserved from one episode to another), then a simple adversarial bandit algorithm (e.g. Exp3) that treats the hypotheses as arms leads to learning. For a more sophisticated example, consider Tian et al which is formulated in the language of game theory, but can be regarded as an infra-Bayesian regret bound for infra-MDPs.

Radical Probabalism and InfraBayes are plausibly two orthogonal dimensions of generalization for rationality. Ultimately we want to generalize in both directions, but to do that, working out the radical-probabilist (IE logical induction) decision theory in more detail might be necessary.

True, but IMO the way to incorporate “radical probabilism” is via what I called Turing RL.

I don’t know how to talk about the CDT vs EDT insight in the InfraBayes world.

I’m not sure what precisely you mean by “CDT vs EDT insight” but our latest post might be relevant: it shows how you can regard infra-Bayesian hypotheses as joint beliefs about observations

*and*actions, EDT-style.Perhaps more importantly, the Troll Bridge insights. As I mentioned in the beginning, in order to meaningfully solve Troll Bridge, it’s necessary to “respect logic” in the right sense. InfraBayes doesn’t do this, and it’s not clear how to get it to do so.

Is there a way to operationalize “respecting logic”? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

From your reply to Paul, I understand your argument to be something like the following:

Any solution to single-single alignment will involve a tradeoff between alignment and capability.

If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.

If AI systems

*are*designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.

We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn’t have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).

I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to “just” solving single-single alignment.

I’m kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both and in a similar way.

More generally, I guess I’m more optimistic than you about solving all such philosophical liabilities.

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.

I don’t understand the proposal. Is there a link I should read?

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with “giant” universes that do all the possible computations you would want, and then using the “free” complexity in the bridge rules to pick which of the computations you actually wanted.

So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not “free complexity” because it’s not coming from a simplicity prior at all. For a program of length , you need a particular DFA of size . However, the actual DFA is of expected size with . The probability of having the DFA you need embedded in that is something like . So moving everything to the bridge makes a much less likely hypothesis.

I don’t understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual “unipolar” sense. These management assistants, DAOs etc

*are not aligned to the goals of their respective, individual users/owners*.I do see two reasons why multipolar scenarios might require more technical research:

Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.

In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability

^{[1]}.

In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.)

However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I’m confused.

In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble.

Yes, you’re right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the “bridge rules” by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here’s the sketch of a proposal how to solve this. Let’s construct our prior to be the

*convolution*of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that’s sampled as follows:First, sample a hypothesis from the Solomonoff prior

Second, choose a number according to some simple distribution with high expected value (e.g. ) with

Third, sample a DFA with states and a uniformly random transition table

Fourth, apply to the output of

We think of the simplicity prior as choosing “physics” (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing “bridge rules” (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of , however the source of our trouble is also “merely” a factor of .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the limit).

- 20 Mar 2021 18:43 UTC; 2 points) 's comment on Formal Solution to the Inner Alignment Problem by (

So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI’s probability of corruption is bounded by .

I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn’t specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.

**The harder the takeoff the more dangerous this attack vector:**During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system*in the beginning of the cycle*^{[1]}. On the other hand, the capability of the attacker depends on its power*in the end of the cycle*. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.**Inner control of anchor makes system safer:**Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.**Additional information about the external world makes system safer:**Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.

More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

Virtually all the credit for this post goes to Alex, I think the proof of Proposition 1 was more or less my only contribution.

The distribution is the user’s policy, and the utility function for this purpose is the

*eventual success probability*estimated by the user (as part of the timeline report), in the end of the “maneuver”. More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

Yes, I

*think*we are talking about the same thing. If you change your distribution over hypotheses, or the distribution over evidence implied by each hypothesis, then it means you’re changing the prior.

IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it’s not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it’s not then we don’t promise anything). On the other hand non-crisp gives a lower bound that is

*variable*with the true distribution. We can think of non-crisp infradistirbutions as being*fuzzy*properties of the distribution (hence the name “crisp”). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).

An alternative explanation of will-power is hyperbolic discounting. Your time discount function is not exponential, and therefore not dynamically consistent. So you can simultaneously (i) prefer gaining short-term pleasure on the expense of long-term goals (e.g. play games instead of studying) and (ii) take actions to prevent future-you from doing the same (e.g. go to rehab).

This seems simpler, but it doesn’t explain why the same drugs that cause/prevent weird beliefs should add/deplete will-power.

“How to weight evidence vs. the prior” is not a free parameter in Bayesianism. What you

*can*have is some parameter controlling the prior itself (so that the prior can be less or more confident about certain things). I guess we can speculate that there are some parameters in the prior and some parameters in the reward function s.t. various drugs affect both of them simultaneously, and maybe there’s a planning-as-inference explanation for why the two are entangled.

There is some truth in that, in the sense that, your beliefs must take a form that is

*learnable*rather than just a god-given system of logical relationships.

Am I right though that in the case of e.g. Newcomb’s problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?

Yes

imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can’t model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you’d see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but “reflecting reality” might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.

I think that if you are offered a single bet, your utility is linear in money and your belief is a

*crisp*infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider and take the set of a-measures generated by and . Suppose you start with dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting dollars on the outcome , with a value of dollars.

IIUC, here’s a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point

in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn’t require heuristic gradient descent): for square loss it’s just solving a large linear system once, for many other losses it should amount toconvexoptimization for which we have provable efficient algorithms. And, I guess it’s underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I’m guessing some of the linked papers might have done something like this?