Academic website: https://www.andrew.cmu.edu/user/coesterh/
Caspar Oesterheld(Caspar Oesterheld)
What’s the reasoning behind mentioning the fairly controversial, often deemed dangerous Roko’s basilisk over less risky forms of acausal trade (like superrational cooperation with human-aligned branches)?
Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.
Given some existing policy , EDT+SSA recommends that upon receiving observation we should choose an action from (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, if and otherwise. is the SSA probability of being in state of the environment trajectory given the observation and the fact that one uses the policy .
The SSA probability is zero if and otherwise. Here, is the number of times occurs in . Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1⁄2 probability to tails in the Sleeping Beauty problem and sticks with 1⁄2 if it’s told that it’s Monday). is the (regular, non-anthropic) probability of the sequence of states , given that is played and is observed at least once. If (as in the sum above) is observed at least once in , we can rewrite this as Importantly, note that is constant in , i.e., the probability that you observe at least once cannot (in the present setting) depend on what you would do when you observe .
Inserting this into the above, we get where the first sum on the right-hand side is over all histories that give rise to observation at some point. Dividing by the number of agents with observation in a history and setting the policy for all agents at the same time cancel each other out, such that this equals Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; if there are multiple observations, then multiple policies might satisfy the right-hand side of this equation and not all of these are optimal.
[1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf
- EDT with updating double counts by 12 Oct 2021 4:40 UTC; 56 points) (
- CDT=EDT=UDT by 13 Jan 2019 23:46 UTC; 39 points) (
- 16 Jan 2019 11:06 UTC; 19 points) 's comment on CDT=EDT=UDT by (
- 28 Mar 2018 12:19 UTC; 1 point) 's comment on Announcement: AI alignment prize winners and next round by (
As one further data point, I also heard people close to/working at Anthropic giving “We won’t advance the state of the art.”-type statements, though I never asked about specifics.
My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC’s comment that they’re arguably not significantly advancing the state of the art.
I suppose even if Opus is only slightly better (or even just perceived to be better) and even if we all expect OpenAI to release a better GPT-4.5 soon, Anthropic could still take a bunch of OpenAI’s GPT-4 business with this. (I’ll probably switch from ChatGPT-4 to Claude, for instance.) So it’s not that hard to imagine an internal OpenAI email saying, “Okay, folks, let’s move a bit faster with these top-tier models from now on, lest too many people switch to Claude.” I suppose that would already be quite worrying to people here. (Whereas, people would probably worry less if Anthropic took some of OpenAI’s business by having models that are slightly worse but cheaper or more aligned/less likely to say things you wouldn’t want models to say in production.)
I think I sort of agree, but...
It’s often difficult to prove a negative and I think the non-existence of a crisp definition of any given concept is no exception to this rule. Sometimes someone wants to come up with a crisp definition of a concept for which I suspect no such definition to exist. I usually find that I have little to say and can only wait for them to try to actually provide such a definition. And sometimes I’m surprised by what people can come up with. (Maybe this is the same point that Roman Leventov is making.)
Also, I think there are many different ways in which concepts can be crisp or non-crisp. I think cooperation can be made crisp in some ways and not in others.
For example, I do think that (in contrast to human values) there are approximate characterizations of cooperation that are useful, precise and short. For example: “Cooperation means playing Pareto-better equilibria.”
One way in which I think cooperation isn’t crisp, is that you can give multiple different sensible definitions that don’t fully agree with each other. (For example, some definitions (like the above) will include coordination in fully cooperative (i.e., common-payoff) games, and others won’t.) I think in that way it’s similar to comparing sets by size, where you can give lots of useful, insightful, precise definitions that disagree with each other. For example, bijection, isomorphism, and the subset relationship can each tell us when one set is larger than or as large as another, but they sometimes disagree and nobody expects that one can resolve the disagreement between the concepts or arrive at “one true definition” of whether one set is larger than another.
When applied to the real world rather than rational agent models, I would think we also inherit fuzziness from the application of the rational agent model to the real world. (Can we call the beneficial interaction between two cells cooperation? Etc.)
I think this is a good overview, but most of the views proposed here seem contentious and the arguments given in support shouldn’t suffice to change the mind of anyone who has thought about these questions for a bit or who is aware of the disagreements about them within the community.
Getting alignment right accounts for most of the variance in whether an AGI system will be positive for humanity.
If your values differ from those of the average human, then this may not be true/relevant. E.g., I would guess that for a utilitarian current average human values are worse than, e.g., 90% “paperclipping values” and 10% classical utilitarianism.
Also, if gains from trade between value systems are big, then a lot of value may come from ensuring that the AI engages in acausal trade (https://wiki.lesswrong.com/wiki/Acausal_trade ). This is doubly persuasive if you already see your own policies as determining what agents with similar decision theories but different values do elsewhere in the universe. (See, e.g., section 4.6.3 of “Multiverse-wide Cooperation via Correlated Decision Making”.)
Given timeline uncertainty, it’s best to spend marginal effort on plans that assume / work in shorter timelines.
Stated simply: If you don’t know when AGI is coming, you should make sure alignment gets solved in worlds where AGI comes soon.
I guess the question is what “soon” means. I agree with the argument provided in the quote. But there are also some arguments to work on longer timelines, e.g.:
If it’s hard and most value comes from full alignment, then why even try to optimize for very short timelines?
Similarly, there is a “social” difficulty of getting people in AI to notice your (or the AI safety community’s) work. Even if you think you could write down within a month a recipe for increasing the probability of AI being aligned by a significant amount, you would probably need much more than a month to make it significantly more likely to get people to consider applying your recipe.
It seems obvious that most people shouldn’t think too much about extremely short timelines (<2 years) or the longest plausible timelines (>300 years). So, these arguments together probably point to something in the middle of these and the question is where. Of course, it also depends on one’s beliefs about AI timelines.
To me it seems that the concrete recommendations (aside from the “do AI safety things”) don’t have anything to do with the background assumptions.
As one datapoint, fields like computer science, engineering and mathematics seem to make a lot more progress than ones like macroeconomics, political theory, and international relations.
For one, “citation needed”. But also: the alternative to doing technical AI safety work isn’t to do research in politics but to do political activism (or lobbying or whatever), i.e. to influence government policy.
As your “technical rather than political” point currently stands, it’s applicable to any problem, but it is obviously invalid at this level of generality. To argue plausibly that technical work on AI safety is more important than AI strategy (which is plausibly true), you’d have to refer to some specifics of the problems related to AI.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
I think that janus isn’t claiming this and I also think it isn’t true. I think it’s all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.
> I tried to understand Caspar’s EDT+SSA but was unable to figure it out. Can someone show how to apply it to an example like the AMD to help illustrate it?
Sorry about that! I’ll try to explain it some more. Let’s take the original AMD. Here, the agent only faces a single type of choice—whether to EXIT or CONTINUE. Hence, in place of a policy we can just condition on when computing our SSA probabilities. Now, when using EDT+SSA, we assign probabilities to being a specific instance in a specific possible history of the world. For example, we assign probabilities of the form , which denotes the probability that given I choose to CONTINUE with probability , history (a.k.a. CONTINUE, EXIT) is actual and that I am the instance intersection (i.e., the first intersection). Since we’re using SSA, these probabilities are computed as follows:That is, we first compute the probability that the history itself is actual (given ). Then we multiply it by the probability that within that history I am the instance at , which is just 1 divided by the number of instances of myself in that history, i.e. 2.
Now, the expected value according to EDT + SSA given can be computed by just summing over all possible situations, i.e. over all combinations of a history and a position within that history and multiplying the probability of that situation with the utility given that situation:
And that’s exactly the ex ante expected value (or UDT-expected value, I suppose) of continuing with probability . Hence, EDT+SSA’s recommendation in AMD is the ex ante optimal policy (or UDT’s recommendation, I suppose). This realization is not original to myself (though I came up with it independently in collaboration with Johannes Treutlein) -- the following papers make the same point:
Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf
Wolfgang Schwarz (2015): Lost memories and useless coins: revisiting the absentminded driver. In: Synthese. https://www.umsu.de/papers/driver-2011.pdf
My comment generalizes these results a bit to include cases in which the agent faces multiple different decisions.
- 16 Jan 2019 23:54 UTC; 2 points) 's comment on In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy by (
Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely “isomorphic”/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer’s example.
[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf
At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer “Who is Mary Lee Pfeiffer’s son?” than “Who is Tom Cruise’s mother?” Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think “Who was born in Ulm?” might be answered correctly by more people than “Where was Einstein born?”, even though “Einstein was born in Ulm” is a more common sentence for people to read than “Ulm is the city that Einstein was born in”.
If I had to run an experiment to test whether similar effects apply in humans, I’d probably try to find cases where A and B in and of themselves are equally salient but the association A → B is nonetheless more salient than the association B → A. The alphabet is an example of this (where the effect is already confirmed).
I looked at the version 2017-12-30 10:48:11Z.
Overall, I think it’s a nice, systematic overview. Below are some comments.
I should note that I’m not very expert on these things. This is also why the additional literature I mention is mostly weakly related stuff from FRI, the organization I work for. Sorry about that.
An abstract would be nice.
Locators in the citations would be useful, i.e. “Beckstead (2013, sect. XYZ)” instead of just “Beckstead (2013)” when you talk about some specific section of the Beckstead paper. (Cf. section “Pageless Documentation” of the humurous Academic Citation Practice: A Sinking Sheep? by Ole Bjørn Rekdal.)
>from a totalist, consequentialist, and welfarist (but not necessarily utilitarian) point of view
I don’t think much of your analysis assumes welfarism (as I understand it)? Q_w could easily denote things other than welfare (e.g., how virtue ethical, free, productive, autonomous, natural, the mean person is), right? (I guess some of the discussion sections are fairly welfarist, i.e. they talk about suffering, etc., rather than freedom and so forth.)
>an existential risk as one where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.
Maybe some people would interpret this definition as excluding some of the “shrieks” and “whimpers”, since in some of them, “humanity’s potential is realized” in that it colonizes space, but not in accordance with, e.g., the reader’s values. Anyway, I think this definition is essentially a quote from Bostrom (maybe use quotation marks?), so it’s alright.
>The first is the probability P of reaching time t.
Maybe say more about why you separate N_w(t) (in the continuous model) into P(t) and N(t)?
I also don’t quite understand whether equation 1 is intended as the expected value of the future or as the expected value of a set of futures w that all have the same N_w(t) and Q_w(t). The problem is that if it’s the expected value of the future, I don’t get how you can simplify something like
into the right side of your equation 1. (E.g., you can’t just let N(t) and Q(t) denote expected numbers of moral patients and expected mean qualities of life, because the mean qualities in larger worlds ought to count for more, right?)
I suspect that when reading the start of sect. 3.1, a lot of readers will wonder whether you endorse all the assumptions underlying your model of P(t). In particular, I would guess that people would disagree with the following two assumptions:
-> Short term x-risk reduction (r_1) doesn’t have any effect on long-term risk (r). Perhaps this is true for some fairly specific work on preventing extinction but it seems less likely for interventions like building up the UN (to avoid all kinds of conflict, coordinate against risks, etc.).
-> Long-term extinction risk is constant. I haven’t thought much about these issues but I would guess that extinction risk becomes much lower, once there is a self-sustaining colony on Mars.
Reading further, I see that you address these in sections 3.2 and 3.3. Maybe you could mention/refer to these somewhere near the start of sect. 3.1.
On page 3, you say that the derivative of -P(t) w.r.t. r_1 denotes the value of reducing r_1 by one unit. This is true in this case because P(t) is linear in r_1. But in general, the value of reducing r_1 by one unit is just P(t,r_1-1)-P(t,r_1), right?
Is equation 3, combined with the view that the cost of one unit of f1 is constant, consistent with Ord’s “A plausible model would be that it is roughly as difficult to halve the risk per century, regardless of its starting probability, and more generally, that it is equally difficult to reduce it by some proportion regardless of its absolute value beforehand.”? With your model, it looks like bringing f_1 from 0 to 0.5 and thus halfing r_1 is just as expensive as bringing f_1 from 0.5 to 1.
On p. 7, “not to far off”—probably you mean “too”?
>For example, perhaps we will inevitably develop some hypothetical weapons that give so large an advantage to offence over defence that civilisation is certain to be destroyed.
AI risk is another black ball that will become more accessible. But maybe you would rather not model it as extinction. At least AI risk doesn’t necessarily explain the Fermi paradox and AIs may create sentient beings.
>Ord argues that we may be able to expect future generations to be more interested in risk reduction, implying increasing f_i
I thought f_i was meant to model the impact that we can have on r_i? So, to me it seems more sensible to model the involvement of future generations, to the extent that we can’t influence it, as a “a kind of event E” (as you propose) or, more generally, as implying that the non-intervention risk levels r_i decrease.
>This would only reinforce the case for extinction risk reduction.
It seems that future generations caring about ERR makes short-term ERR more important (because the long-term future is longer and thus can contain more value). But it makes long-term ERR less important, because future generations will, e.g., do AI safety research anyway. (In section “Future resources” of my blog post Complications in evaluating neglectedness, I make the general point that for evaluating the neglectedness of an intervention, one has to look at how many resources future generations will invest into that intervention.)
>There is one case in which it clearly is not: if space colonisation is in fact likely to involve risk-independent islands. Then high population goes with low risk, increasing the value of the future relative to the basic model
(I find risk-independent islands fairly plausible.)
>The expected number of people who will live in period t is
You introduced N(t) as the number of morally relevant beings (rather than “people”).
>However, this increase in population may be due to stop soon,
Although it is well-known that some predict population to stagnate at 9 billion or so, a high-quality citation would be nice.
>The likelihood of space colonisation, a high-profile issue on which billions of dollars is spent per year (Masters, 2015), also seems relatively hard to affect. Extinction risk reduction, on the other hand, is relatively neglected (Bostrom, 2013; Todd, 2017), so it could be easier to achieve progress in this area.
I have only briefly (in part due to the lack of locators) checked the two sources, but it seems that this varies strongly between different extinction risks. For instance, according to Todd (2017), >300bn (and thus much more than on space colonization) is spent on climate change, 1-10bn on nuclear security, 1bn on extreme pandemic prevention. So, overall much more money goes into extinction risk reduction than into space colonization. (This is not too surprising. People don’t want to die, and they don’t want their children or grandchildren to die. They don’t care nearly as much about whether some elite group of people will live on Mars in 50 years.)
Of course, there a lot of complications to this neglectedness analysis. (All three points I discuss in Complications in evaluating neglectedness seem to apply.)
>Some people believe that it’s nearly impossible to have a consistent impact on Q(t) so far into the future.
Probably a reference would be good. I guess to the extent that we can’t affect far future Q(t), we also can’t affect far future r_i.
>However, this individual may be biased against ending things, for instance because of the survival instinct, and so could individuals or groups in the future. The extent of this bias is an open question.
It’s also a bit unclear (at least based on hat you write) what legitimizes calling this a bias, rather than simply a revealed preference not to die (even in cases in which you or I as outside observers might think it to be preferable not to live) and thus evidence that their lives are positive. Probably one has to argue via status quo bias or sth like that.
>We may further speculate that if the future is controlled by altruistic values, even powerless persons are likely to have lives worth living. If society is highly knowledgeable and technologically sophisticated, and decisions are made altruistically, it’s plausible that many sources of suffering would eventually be removed, and no new ones created unnecessarily. Selfish values, on the other hand, do not care about the suffering of powerless sentients.
This makes things sound a more binary than they actually are. (I’m sure you’re aware of this.) In the usual sense of the word, people could be “altruistic” but in a non-consequentialist way. There may be lots of suffering in such worlds. (E.g., some libertarians may be regard intervening in the economy as unethical even if companies start creating slaves. A socialist, on the other hand, may view capitalism as fundamentally unjust, try to regulate/control the economy and thus cause a lot of poverty.) Also, even if someone is altruistic in a fairly consequentialist way, they may still not care about all beings that you/I/the reader cares about. E.g., economists tend to be consequentialists but rarely consider animal welfare.
I think for the animal suffering (both wild animals and factory farming) it is worth noting that it seems fairly unlikely that this will be economically efficient in the long term, but that the general underlying principles (Darwinian suffering and exploiting the powerless) might carry over to other beings (like sentient AIs).
Another way in which the future may be negative would be the Malthusian trap btw. (Of course, some would regard at least some Malthusian trap scenarios as positive, see, e.g., Robin Hanson’s The Age of Em.) Presumably this belongs to 5.2.1, since it’s a kind of coordination failure.
As you say, I think the option value argument isn’t super persuasive, because it seems unlikely that the people in power in a million years share my (meta-)values (or agree with the way I do compromise).
Re 5.2.3: Another relevant reference on why one should cooperate—which is somewhat separate from the point that if mutual cooperation works out the gains from trade are great—is Brian Tomasik’s Reasons to Be Nice to Other Value Systems.
>One way to increase Q(t) is to advocate for positive value changes in the direction of greater consideration for powerless sentients, or to promote moral enhancement (Persson and Savulescu, 2008). Another approach might be to work to improve political stability and coordination, making conflict less likely as well as increasing the chance that moral progress continues.
Relevant:
https://foundational-research.org/international-cooperation-vs-ai-arms-race/
http://reducing-suffering.org/values-spreading-often-important-extinction-risk/
Here’s a simple toy model that illustrates the difference between 2 and 3 (that doesn’t talk about attention layers, etc.).
Say you have a bunch of triplets . Your want to train a model that predicts from and from .
Your model consists of three components: . It makes predictions as follows:
(Why have such a model? Why not have two completely separate models, one for predicting and one for predicting ? Because it might be more efficient to use a single both for predicting and for predicting , given that both predictions presumably require “interpreting” .)
So, intuitively, it first builds an “inner representation” (embedding) of . Then it sequentially makes predictions based on that inner representation.
Now you train and to minimize the prediction loss on the parts of the triplets. Simultaneously you train to minimize prediction loss on the full triplets. For example, you update and with the gradients
and you update and with the gradients
.
(The here is the “true” , not one generated by the model itself.)This training pressures to be myopic in the second and third sense described in the post. In fact, even if we were to train with the predicted by rather than the true , is pressured to be myopic.
Type 3 myopia: Training doesn’t pressure to output something that makes the follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data implies , while under , follows some distribution that depends in complicated ways on . Then will not try to predict more often.
Type 2 myopia: won’t try to provide useful information to in its output, even if it could. For example, imagine that the s are strings representing real numbers. Imagine that is always a natural number, that is the -th Fibonacci number and is the -th Fibonacci number. Imagine further that the model representing is large enough to compute the -th Fibonacci number, while the model representing is not. Then one way in which one might think one could achieve low predictive loss would be for to output the -th Fibonacci number and then encode, for example, the -th Fibonacci number in the decimal digits. (E.g., .) And then computes the -th Fibonacci number from the -th decimal. But the above training will not give rise to this strategy, because gets the true as input, not the one produced by . Further, even if we were to change this, there would still be pressure against this strategy because () is not optimized to give useful information to . (The gradient used to update doesn’t consider the loss on predicting .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on .
Of course, still won’t be pressured to be type-1-myopic. If predicting requires predicting , then will be trained to predict (“plan”) .
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about ? Well, is optimized both to enable predicting from and predicting from . Therefore, if resources are relevantly constrained in some way (e.g., the model computing is small, or the output of is forced to be small), will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for (and thus in some sense the overall model) can and will sacrifice accuracy on to achieve better accuracy on . In particular, we should expect trained models to find an efficient tradeoff between accuracy on and accuracy on . When is relatively easy to predict, will spend most of its computation budget on predicting .
So, is not “Type 2” myopic. Or perhaps put differently: The calculations going into predicting aren’t optimized purely for predicting .
However, is still “Type 3” myopic. Because the prediction made by isn’t fed (in training) as an input to or the loss, there’s no pressure towards making influence the output of in a way that has anything to do with . (In contrast to the myopia of , this really does hinge on not using in training. If mattered in training, then there would be pressure for to trick into performing calculations that are useful for predicting . Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.
Great to see more work on surrogate goals/SPIs!
>Personally, the author believes that SPI might “add up to normality”—that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I’m a bit confused by this claim. To me it’s a bit unclear what you mean by “adding up to normality”. (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn’t change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don’t fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, …) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it’s still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren’t justified at all in your post. (E.g., re A1, you’re saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don’t know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X. Here are three instances of this (I think these are the only three), the first one being most significant.
The main objection of the post is that while adopting an SPI, the original players must keep a bunch of things (at least approximately) constant(/analogous to the no-SPI counterfactual) even when they have an incentive to change that thing, and they need to do this credibly (or, rather, make it credible that they aren’t making any changes). You argue that this is often unrealistic. Well, the initial reaction of mine was: “Sure, I know these things!” (Relatedly: while I like the bandit v caravan example, this point can also be illustrated with any of the existing examples of SPIs and surrogate goals.) I also don’t think the assumption is that unrealistic. It seems that one substantial part of your complaint is that besides instructing the representative/self-modifying the original player/principal can do other things about the threat (like advocating a ban on real or water guns). I agree that this is important. If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway. But it’s easy to come up scenarios where this is not a problem. E.g., when an agent considers immediate self-modification, *all* her future decisions will be guided by the modified u.f. Or when the SPI is applied to some isolated interaction. When all is in the representative’s hand, we only need to ensure that the *representative* always acts in whatever way the representative acts in the same way it would act in a world where SPIs aren’t a thing.
And I don’t think it’s that difficult to come up with situations in which the latter thing can be comfortably achieved. Here is one scenario. Imagine the two of us play a particular game G with SPI G’. The way in which we play this is that we both send a lawyer to a meeting and then the lawyers play the game in some way. Then we could could mutually commit (by contract) to pay our lawyers in proportion to the utilities they obtain in G’ (and to not make any additional payments to them). The lawyers at this point may know exactly what’s going on (that we don’t really care about water guns, and so on) -- but they are still incentivized to play the SPI game G’ to the best of their ability. You might even beg your lawyer to never give in (or the like), but the lawyer is incentivized to ignore such pleas. (Obviously, there could still be various complications. If you hire the lawyer only for this specific interaction and you know how aggressive/hawkish different lawyers are (in terms of how they negotiate), you might be inclined to hire a more aggressive one with the SPI. But you might hire the lawyer you usually hire. And in practice I doubt that it’d be easy to figure out how hawkish different lawyers are.
Overall I’d have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic). I don’t remember Tobi’s posts very well, but our paper definitely doesn’t spend much space on discussing these important questions.On SPI selection, I think the point from Section 10 of our paper is quite important, especially in the kinds of games that inspired the creation of surrogate goals in the first place. I agree that in some games, the SPI selection problem is no easier than the equilibrium selection problem in the base game. But there are games where it does fundamentally change things because *any* SPI that cannot further be Pareto-improved upon drastically increases your utility from one of the outcomes.
Re the “Bargaining in SPI” section: For one, the proposal in Section 9 of our paper can still be used to eliminate the zeroes!
Also, the “Bargaining in SPI” and “SPI Selection” sections to me don’t really seem like “objections”. They are limitations. (In a similar way as “the small pox vaccine doesn’t cure cancer” is useful info but not an objection to the small pox vaccine.)
Nice post!
Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.
In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren’t subsidized. This ensures that the game is zero-sum for the predictors—they don’t prefer one action to be taken over another. In the scoring rules setting, subsidizing would mean scoring relative to some initial prediction $p_0$ provided by the market. Because the initial prediction might differ in how bad it is for different actions, the predictors might prefer a particular action to be taken. Conversely, the predictors might have no incentive to correct an overly optimistic prediction for one of the actions if doing so causes that action not to be taken. The examples in Section 3.2 of the Othman and Sandholm paper show these things.
- The second is “optimism bias” (a good thing in this context): “If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one.” (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don’t have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant’s post “Two Major Obstacles for Logical Inductor Decision Theory”: One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you’re still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).My work on decision scoring rules describes the best you can get out of a single predictor. Basically you can incentivize a single predictor to tell you what the best action is and what the expected utility of that action is, but nothing more (aside from some degenerate cases).
Your result shows that if you have two predictors with the same information, then you can get slightly more: you can incentivize them to tell you what the best action is and what the full distribution over outcomes will be if you take the action.
You also get some other stuff (as you describe starting from the sentence, “Additionally, there is a bound on how inaccurate...”). But these other things seem much less important. (You also say: “while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them.” But the same is true of decision scoring rules for example.)
Here’s one thing that is a bit unclear to me, though.
If you have two predictors that have the same information, there’s other, more obvious stuff you can do. For example, here’s one:
- Ask Predictor 1 for a recommendation for what to do.
- Ask Predictor 2 for a prediction over outcomes conditional on Predictor 1′s recommendation.
- Take the action recommended by Predictor 1.
- Observe an outcome o with a utility u(o).
- Pay Predictor 1 in proportion to u(o).
- Pay Predictor 2 according to a proper scoring rule.In essence, this is just splitting the task into two: There’s the issue of making the best possible choice and there’s the issue of predicting what will happen. We assign Predictor 1 to the first and Predictor 2 to the second problem. For each of these problems separately, we know what to do (use proper (decision) scoring rules). So we can solve the overall problem.
So this mechanism also gets you an honest prediction and an honest recommendation for what to do. In fact, one advantage of this approach is that honesty is maintained even if the Predictors 1 and 2 have _different_ information/beliefs! (You don’t get any information aggregation with this (though see below). But your approach doesn’t have any information aggregation either.)
As per the decision scoring rules paper, you could additionally ask Predictor 1 for an estimate of the expected utility you will obtain. You can also let the Predictor 2 look at Predictor 1′s prediction (or perhaps even score Predictor 2 relative to Predictor 1′s prediction). (This way you’d get some information aggregation.) (You can also let Predictor 1 look at Predictor 2′s predictions if Predictor 2 starts out by making conditional predictions before Predictor 1 gives a recommendation. This gets more tricky because now Predictor 2 will want to mislead Predictor 1.)
I think your proposal for what to do instead of the above is very interesting and I’m glad that we now know that this method exists that that it works. It seems fundamentally different and it seems plausible that this insight will be very useful. But is there some concrete advantage of zero-sum conditional prediction over the above method?
Free will is a controversial, confusing term that, I suspect, different people take to mean different things. I think to most readers (including me) it is unclear what exactly the Case 1 versus 2 distinction means. (What physical property of the world differs between the two worlds? Maybe you mean not having free will to mean something very mundane, similar to how I don’t have free will about whether to fly to Venus tomorrow because it’s just not physically possible for me to fly to Venus, so I have to “decide” not to fly to Venus?)
I generally think that free will is not so relevant in Newcomb’s problem. It seems that whether there is some entity somewhere in the world that can predict what I’m doing shouldn’t make a difference for whether I have free will or not, at least if this entity isn’t revealing its predictions to me before I choose. (I think this is also the consensus on this forum and in the philosophy literature on Newcomb’s problem.)
>CDT believers only see the second decision. The key here is realising there are two decisions.
Free will aside, as far as I understand, your position is basically in line with what most causal decision theorists believe: You should two-box, but you should commit to one-boxing if you can do so before your brain is scanned. Is that right? (I can give some references to discussions of discussions of CDT and commitment if you’re interested.)
If so, how do you feel about the various arguments that people have made against CDT? For example, what would you do in the following scenario?
>Two boxes, B1 and B2, are on offer. You may purchase one or none of the boxes but not both. Each of the two boxes costs $1. Yesterday, Omega put $3 in each box that she predicted you would not acquire. Omega’s predictions are accurate with probability 0.75.
In this scenario, CDT always recommends buying a box, which seems like a bad idea because from the perspective of the seller of the boxes, they profit when you buy from them.
>TDT believers only see the first decision, [...] The key here is realising there are two decisions.
I think proponents of TDT and especially Updateless Decision Theory and friends are fully aware of this possible “two-decisions” perspective. (Though typically Newcomb’s problem is described as only having one of the two decision points, namely the second.) They propose that the correct way to make the second decision (after the brain scan) is to take the perspective of the first decision (or similar). Of course, one could debate whether this move is valid and this has been discussed (e.g., here, here, or here).
Also: Note that evidential decision theorists would argue that you should one-box in the second decision (after the brain scan) for reasons unrelated to the first-decision perspective. In fact, I think that most proponents of TDT/UDT/… would agree with this reasoning also, i.e., even if it weren’t for the “first decision” perspective, they’d still favor one-boxing. (To really get the first decision/second decision conflict you need cases like counterfactual mugging.)
To me Bing Chat actually seems worse/less impressive (e.g., more likely to give incorrect or irrelevant answers) than ChatGPT, so I’m a bit surprised. Am I the only one that feels this way? I’ve mostly tried the two systems on somewhat different kinds of prompts, though. (For example, I’ve tried (with little success) to use Bing Chat instead of Google Search.) Presumably some of this is related to the fine-tuning being worse for Bing? I also wonder whether the fact that Bing Chat is hooked up to search in a somewhat transparent way makes it seem less impressive. On many questions it’s “just” copy-and-pasting key terms of the question into a search engine and summarizing the top result. Anyway, obviously I’ve not done any rigorous testing...
Excellent—we should ask THEM about it.
Yes, that’s the plan.
Some papers that express support for CDT:
https://link.springer.com/article/10.1007/s11229-011-0022-6 (In general, James Joyce is a well-known defender of CDT.)
https://link.springer.com/article/10.1007/s11098-018-1206-4
https://onlinelibrary.wiley.com/doi/full/10.1111/phpr.12466 (argues for two-boxing, but against CDT)
In his book Causality, Judea Pearl also argues in favor of CDT (though he doesn’t explicitly discuss Newcomb’s problem).
In case you just want to know why I believe support for CDT/two-boxing to be wide-spread among academic philosophers, see https://philpapers.org/archive/BOUWDP.pdf , which is a survey of academic philosophers, where more people preferred two-boxing than one-boxing in Newcomb’s problem, especially among philosophers with relevant expertise. Some philosophers have filled out this survey publicly, so you can e.g. go to https://philpapers.org/surveys/public_respondents.html?set=Target+faculty , click on a name and then on “My Philosophical Views” to find individuals who endorse two-boxing. (I think there’s also a way to download the raw data and thereby get a list of two-boxers.)
Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)
Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these topics. Please correct me if I’m misunderstanding anything! The writing is not very polished—sorry!
Ignoring all the sequential stuff, my understanding is that the first paper basically does this: First, we train a model to predict utilities after observing actions, i.e., make predictions conditional on actions. So in particular, we get a function a ---> E[utility | a] that maps an observed action by the agent onto a prediction of future reward/utility. Then if we use some procedure to find the action a that maximizes E[utility | a], it seems that we have an EDT agent. I think this is essentially the case of an “EDT overseer” who rewards based on actions (rather than outcomes) in “Approval-directed agency and the decision theory of Newcomb-like problems”. Also see the discussion of Obstacle 1 in “Two Major Obstacles for Logical Inductor Decision Theory”.
Now what could go wrong with this? I think in some sense the problem is generally that it’s unclear how the predictive model works, or where it comes from. The second paper (the DeepMind one) basically points out one issue with this. Other issues are known to this community. I’ll start with an issue that has been known to this community: the 5 and 10 problem / the problem of counterfactuals. If the agent always (reliably) chooses the action a that maximizes E[utility | a], then the predictive model’s counterfactual predictions (i.e., predictions for all other actions) could be nonsensical without being strictly speaking wrong. So for example, in 5 and 10, you choose between a five dollar bill and a ten dollar bill. (There’s no catch and you should clearly just take the ten dollar bill.) The model predicts that if you take the five dollar bill, you will get five dollars, and (spuriously / intuitively falsely) that if you take the ten dollar bill, you get nothing. Because you are maximizing expected utility according to this particular predictive model, you take the five dollars. So the crazy prediction for what happens if you take the ten dollars is never falsified.
In non-Newcomb-like scenarios, a simple, extremely standard solution to this problem is to train the predictive model (the thing that gives a ---> E[utility | a]) while the agent follows some policy that randomizes over all actions (perhaps one that takes actions with probabilities in proportion to the model’s predictions E[utility | a]). My understanding is that this is how the first paper avoids these issues and gives good results. Unfortunately, in Newcomb-like problems these approaches tend to lead to pretty CDT-ish behavior, as shown in “Reinforcement Learning in Newcomblike Environments”.
Anyway, the second paper (the DeepMind one) points out another issue related to where the E[utility | action] model comes from. Roughly, the story — which I think is very well described in Section 2 — seems to be the following: the E[utility | action] model is trained on the actions of an expert who knows whether X=1,2 and acts on that fact by choosing A=X; then the E[utility | action] model won’t work for a non-expert agent, i.e., one who doesn’t observe X. I view this as a distributional shift issue — you train a model (the a ---> E[utility | a] one) in a setting where A=X, and then you apply it in a setting where sometimes A and X are uncorrelated.
It’s also similar to the Smoking Lesion/medical Newcomb-like problems! Consider the following medical Newcomb-like problem: First we learn the fact that sick people go to the doctor and healthy people don’t go to the doctor. Then without looking at how healthy I am, I don’t go to the doctor so as to gain evidence that I am healthy. Arguably what goes wrong here is also that I’m using a rule for prediction out of distribution on someone who doesn’t look at whether they’re sick. I think it relates to one of the least challenging versions of medical Newcomb-like problems and it’s handled comfortably by the so-called tickle defense.
Interlude: The paper talks about how this relates to hallucination in LLMs. So what’s that about? IIUC, the idea is that when generating text, LLMs incorrectly update based on the text they generate themselves. For example, imagine that you want an LLM to generate ten tokens. Then after generating the first nine tokens, it will predict the tenth token from its learned distribution . But this distribution was trained on fully human- not LLM-written text. So (in my way of thinking), might do poorly (i.e., not give a human-like continuation of ), because it was trained on seeing nine tokens created by a human and having to predict a continuation by a human rather than nine tokens by itself/an LLM and having to predict a continuation by a human. For example, we might imagine that if are words that only a human expert confident in a particular claim C would say, then the LLM will predict continuations that confidently defend claim C, even if the LLM doesn’t know anything about C. I’m not sure I really buy this explanation of hallucination. I think the claim would need more evidence than the authors provide. But it’s definitely a very interesting point.
Now, back to the original toy model. Again, I would view this as a distribution shift problem. If we make some assumptions, though, we can infer/guess a model (i.e. function a ---> E[utility | a]) that predicts the utility obtained by a non-expert, i.e., an agent who doesn’t observe X. Specifically, let’s assume that we are told the conditional distributions P(utility | X=1, A=0) and P(utility | X=0, A=1) (which we never see in training if the agent in training always knows and acts on X). Let’s also assume that we know that the difference between the training distribution and the new setting is that in the new setting the agent chooses A independently of X. Then in the new model we just need to make X and A independent and change nothing else. Formally you use the new distribution P’(X,U|A) = P(X)P(U|A,X), where the Ps on the right-hand side are just the old distribution, instead of P(X,U|A) = P(X|A)P(U|A,X).
It turns out that if we put the original distribution into a causal graph with edges X->A and A->U and X->U and then make a do-intervention on A (a la Pearl), then we get this exact distribution, i.e., P(X,U|do(A)) = P’(X,U|A). (Intuitively, removing the inference from A to X is exactly what the do(A) does if A’s parent is X.) So in particular maximizing E[U | do(A)] gives the same result as maximizing E’[U|A]. Anyway, the paper uses the do operator to construct the new predictor, rather than the above argument. They seem to claim that the causal structure (or reasoning about causality) is necessary to construct the new predictor, with which I disagree.
Is this really CDT? I’m not sure… In the above type of case, this doesn’t come apart from EDT. If we buy that their scenario is a bit like a Smoking Lesion, then one could argue that part of the point of CDT is to solve this type of scenario. (In some sense my response is as in most versions of the Smoking Lesion: Because of the tickle defense, EDT applied properly gets this right anyway, so there’s actually nothing to fix here.) In my view it’s basically just about using the do-calculus to concisely specify the scenario P’ (based on P plus a particular causal graph for P). It seems that one can do these things without being committed to using do(A) in a scenario where there’s some non-causal dependence between A and U (that doesn’t disappear outside of training), perhaps via some common cause Y. In any case, the paper doesn’t tell us how to distinguish between U ← Y → A and A → Y → U — all causal relationships are assumed. So while nominally they construct their predictor as E[U | do(A)], it’s a bit unclear how wedded they are to CDT.
Anyway, with a (maybe-causalist) E[U | do(A)] in hand, we can of course build a (maybe-)CDT agent by choosing a to maximize E[U | do(A)]. But I think the paper doesn’t say anything about where to get the causal model from that gives us E[U | do(A)]. They pretty much assume that the model is provided.
I think the “counterfactual teaching” stuff doesn’t really say anything about CDT versus EDT, either. IIUC the basic idea is this. Imagine you want to train an LLM and you want to prevent the issue above. Then intuitively — in my distribution shift view — what we need to do is just train the LLM to make a good prediction upon observing that were generated by itself (rather than humans). The simplest, most obvious way to do this is to let the LLM generate some tokens , then get a probabilistic prediction about the next token from the LLM and then ask a human to give a next token . The loss of the LLM is just the, e.g., log loss of its prediction against the provided by the human. One slightly tricky point here is that we only train the LLM to make good predictions on . We don’t want to train it to output that make easier to predict. So we need to be careful to choose the right gradient. I think that’s basically all they’re doing, though. It doesn’t seem like there’s anything causalist here.
So, in conclusion: While very interesting, I don’t think these papers tell us anything new about how to build an EDT or a CDT agent.
>I’m not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
Sorry if I was cryptic! Yes, it’s basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because “quasi-strictly proper scoring rule w.r.t. the max decision rule” is a mouthful. :-P) Does that help?
>much safer than having it effectively chosen for them by their specification of a utility function
So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...
>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn’t pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
Let’s grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.
Maybe put differently: I still don’t know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don’t think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human’s values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won’t choose to eat mud and that therefore its prediction about mud won’t be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can’t gain by recommending mud eating).
On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to “look at” the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors’ incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.
I think in the social choice literature, people almost always mean preference utilitarianism when they say “utilitarianism”, whereas in the philosophical/ethics literature people are more likely to mean hedonic utilitarianism. I think the reason for this is that in the social choice and somewhat adjacent game (and decision) theory literature, utility functions have a fairly solid foundation as a representation of preferences of rational agents. (For example, Harsanyi’s “[preference] utilitarian theorem” paper and Nash’s paper on the Nash bargaining solution make very explicit reference to this foundation.) Whereas there is no solid foundation for numeric hedonic welfare (at least not in this literature, but also not elsewhere as far as I know).