Academic website: https://www.andrew.cmu.edu/user/coesterh/
Caspar Oesterheld(Caspar Oesterheld)
I agree that some notions of free will imply that Newcomb’s problem is impossible to set up. But if one of these notion is what is meant, then the premise of Newcomb’s problem is that these notions are false, right?
It also happens that I disagree with these notions as being relevant to what free will is.
Anyway, if this had been discussed in the original post, I wouldn’t have complained.
What’s the reasoning behind mentioning the fairly controversial, often deemed dangerous Roko’s basilisk over less risky forms of acausal trade (like superrational cooperation with human-aligned branches)?
Free will is a controversial, confusing term that, I suspect, different people take to mean different things. I think to most readers (including me) it is unclear what exactly the Case 1 versus 2 distinction means. (What physical property of the world differs between the two worlds? Maybe you mean not having free will to mean something very mundane, similar to how I don’t have free will about whether to fly to Venus tomorrow because it’s just not physically possible for me to fly to Venus, so I have to “decide” not to fly to Venus?)
I generally think that free will is not so relevant in Newcomb’s problem. It seems that whether there is some entity somewhere in the world that can predict what I’m doing shouldn’t make a difference for whether I have free will or not, at least if this entity isn’t revealing its predictions to me before I choose. (I think this is also the consensus on this forum and in the philosophy literature on Newcomb’s problem.)
>CDT believers only see the second decision. The key here is realising there are two decisions.
Free will aside, as far as I understand, your position is basically in line with what most causal decision theorists believe: You should two-box, but you should commit to one-boxing if you can do so before your brain is scanned. Is that right? (I can give some references to discussions of discussions of CDT and commitment if you’re interested.)
If so, how do you feel about the various arguments that people have made against CDT? For example, what would you do in the following scenario?
>Two boxes, B1 and B2, are on offer. You may purchase one or none of the boxes but not both. Each of the two boxes costs $1. Yesterday, Omega put $3 in each box that she predicted you would not acquire. Omega’s predictions are accurate with probability 0.75.
In this scenario, CDT always recommends buying a box, which seems like a bad idea because from the perspective of the seller of the boxes, they profit when you buy from them.
>TDT believers only see the first decision, [...] The key here is realising there are two decisions.
I think proponents of TDT and especially Updateless Decision Theory and friends are fully aware of this possible “two-decisions” perspective. (Though typically Newcomb’s problem is described as only having one of the two decision points, namely the second.) They propose that the correct way to make the second decision (after the brain scan) is to take the perspective of the first decision (or similar). Of course, one could debate whether this move is valid and this has been discussed (e.g., here, here, or here).
Also: Note that evidential decision theorists would argue that you should one-box in the second decision (after the brain scan) for reasons unrelated to the first-decision perspective. In fact, I think that most proponents of TDT/UDT/… would agree with this reasoning also, i.e., even if it weren’t for the “first decision” perspective, they’d still favor one-boxing. (To really get the first decision/second decision conflict you need cases like counterfactual mugging.)
I haven’t read this page in detail. I agree, obviously, that on many prompts Bing Chat, like ChatGPT, gives very impressive answers. Also, there are clearly examples on which Bing Chat gives a much better answer than GPT3. But I don’t give lists like the one you linked that much weight. For one, for all I know, the examples are cherry-picked to be positive. I think for evaluating these models it is important that they sometimes give indistinguishable-from-human answers and sometimes make extremely simple errors. (I’m still very unsure about what to make of it overall. But if I only knew of all the positive examples and thought that the corresponding prompts weren’t selection-biased, I’d think ChatGPT/Bing is already superintelligent.) So I give more weight to my few hours of generating somewhat random prompts (though I confess, I sometimes try deliberately to trip either system up). Second, I find the examples on that page hard to evaluate, because they’re mostly creative-writing tasks. I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.
That’s interesting, but I don’t give it much weight. A lot of things that are close to Monty Fall are in GPT’s training data. In particular, I believe that many introductions to the Monty Hall problem discuss versions of Monty Fall quite explicitly. Most reasonable introductions to Monty Hall discuss that what makes the problem work is that Monty Hall opens a door according to specific rules and not uniformly at random. Also, even humans (famously) get questions related to Monty Hall wrong. If you talk to a randomly sampled human and they happen to get questions related to Monty Hall right, you’d probably conclude (or at least strongly update towards thinking that) they’ve been exposed to explanations of the problem before (not that they solved it all correct on the spot). So to me the likely way in which LLMs get Monty Fall (or Monty Hall) right is that they learn to better match it onto their training data. Of course, that is progress. But it’s (to me) not very impressive/important. Obviously, it would be very impressive if it got any of these problems right if they had been thoroughly excluded from its training data.
To me Bing Chat actually seems worse/less impressive (e.g., more likely to give incorrect or irrelevant answers) than ChatGPT, so I’m a bit surprised. Am I the only one that feels this way? I’ve mostly tried the two systems on somewhat different kinds of prompts, though. (For example, I’ve tried (with little success) to use Bing Chat instead of Google Search.) Presumably some of this is related to the fine-tuning being worse for Bing? I also wonder whether the fact that Bing Chat is hooked up to search in a somewhat transparent way makes it seem less impressive. On many questions it’s “just” copy-and-pasting key terms of the question into a search engine and summarizing the top result. Anyway, obviously I’ve not done any rigorous testing...
There’s a Math Stack Exchange question: “Conjectures that have been disproved with extremely large counterexamples?” Maybe some of the examples in the answers over there would count? For example, there’s Euler’s sum of powers conjecture, which only has large counterexamples (for high k), found via ~brute force search.
>Imagine trying to do physics without being able to say things like, “Imagine we have a 1kg frictionless ball...”, mathematics without being able to entertain the truth of a proposition that may be false or divide a problem into cases and philosophy without being allowed to do thought experiments. Counterfactuals are such a basic concept that it makes sense to believe that they—or something very much like them—are a primitive.
In my mind, there’s quite some difference between all these different types of counterfactuals. For example, consider the counterfactual question, “What would have happened if Lee Harvey Oswald hadn’t shot Kennedy?” I think the meaning of this counterfactual is kind of like the meaning of the word “chair”.
- For one, I don’t think this counterfactual is very precisely defined. What exactly are we asked to imagine? A world that is like ours, except the laws of physics in Oswalds gun where temporarily suspended to save JFK’s life? (Similarly, it is not exactly clear what counts as a chair (or to what extent) and what doesn’t.)
- Second, it seems that the users of the English language all have roughly the same understanding of what the meaning of the counterfactual is, to the extent that we can use it to communicate effectively. For example, if I say, “if LHO hadn’t shot JFK, US GDP today would be a bit higher than it is in fact”, then you might understand that to mean that I think JFK had good economic policies, or that people were generally influenced negatively by the news of his death, or the like. (Maybe a more specific example: “If it hadn’t suddenly started to rain, I would have been on time.” This is a counterfactual, but it communicates things about the real world, such as: I didn’t just get lost in thought this morning.) (Similarly, when you tell me to get a “chair” from the neighboring room, I will typically do what you want me to do, namely to bring a chair.)
- Third, because it is used for communication, some notions of counterfactuals are more useful than others, because they are better for transferring information between people. At the same time, usefulness as a metric still leaves enough open to make it practically and theoretically impossible to identify a unique optimal notion of counterfactuals. (Again, this is very similar to a concept like “chair”. It is objectively useful to have a word for chairs. But it’s not clear whether it’s more useful for “chair” to include or exclude .)
- Fourth, adopting whatever notion of counterfactual we adopt for this purpose has no normative force outside of communication—they don’t interact with our decision theory or anything. For example, causal counterfactuals as advocated by causal decision theorists are kind of similar to the “If LHO hadn’t shot JFK” counterfactuals. (E.g., both are happy to consider literally impossible worlds.) As you probably know, I’m partial to evidential decision theory. So I don’t think these causal counterfactuals should ultimately be the guide of our decisions. Nevertheless, I’m as happy as anyone to adopt the linguistic conventions related to “if LHO hadn’t shot JFK”-type questions. I don’t try to reinterpret the counterfactual question as a conditional one. (Note that answers to, “how would you update on the fact that JFK survived the assassination?”, would be very different from answers to the counterfactual question. (“I’ve been lied to all my life. The history books are all wrong.”) But other conditionals could come much closer.) (Similarly, using the word “chair” in the conventional way doesn’t commit one to any course of action. In principle, Alice might use the term “chair” normally, but never sit on chairs, or only sit on green chairs, or never think about the chair concept outside of communication, etc.)So in particular, the meaning of counterfactual claims about JFK’s survival don’t seem necessarily very related to the counterfactuals used in decision making. (The question, “what would happen if I don’t post this comment?” that I asked myself prior to posting this comment.)
In math, meanwhile, people seem to consider counterfactuals mainly for proofs by contradiction, i.e., to prove that the claims are contrary to fact. Cf. https://en.wikipedia.org/wiki/Principle_of_explosion , which makes it difficult to use the regular rules of logic to talk about counterfactuals.
Do you agree or disagree with this (i.e., with the claim that these different uses of counterfactuals aren’t very closely connected)?
Stop-gradients lead to fixed point predictions
Proper scoring rules don’t guarantee predicting fixed points
In general it seems that currently the podcast can only be found on Spotify.
So the argument/characterization of the Nash bargaining solution is the following (correct?): The Nash bargaining solution is the (almost unique) outcome o for which there is a rescaling w of the utility functions such that both the utilitarian solution under rescaling w and the egalitarian solution under rescaling w is o. This seems interesting! (Currently this is a bit hidden in the proof.)
Do you show the (almost) uniqueness of o, though? You show that the Nash bargaining solution has the property, but you don’t show that no other solution has this property, right?
Nice!
I’d be interested in learning more about your views on some of the tangents:
>Utilities are bounded.
Why? It seems easy to imagine expected utility maximizers whose behavior can only be described with unbounded utility functions, for example.
>I think many phenomena that get labeled as politics are actually about fighting over where to draw the boundaries.
I suppose there are cases where the connection is very direct (drawing district boundaries, forming coalitions for governments). But can you say more about what you have in mind here?
Also:
>Not, they are in a positive sum
I assume the first word is a typo. (In particular, it’s one that might make the post less readable, so perhaps worth correcting.)
I think in the social choice literature, people almost always mean preference utilitarianism when they say “utilitarianism”, whereas in the philosophical/ethics literature people are more likely to mean hedonic utilitarianism. I think the reason for this is that in the social choice and somewhat adjacent game (and decision) theory literature, utility functions have a fairly solid foundation as a representation of preferences of rational agents. (For example, Harsanyi’s “[preference] utilitarian theorem” paper and Nash’s paper on the Nash bargaining solution make very explicit reference to this foundation.) Whereas there is no solid foundation for numeric hedonic welfare (at least not in this literature, but also not elsewhere as far as I know).
>Anthropically, our existence provides evidence for them being favored.
There are some complications here. It depends a bit on how you make anthropic updates (if you do them at all). But it turns out that the version of updating that “works” with EDT basically doesn’t make the update that you’re in the majority. See my draft on decision making with anthropic updates.
>Annex: EDT being counter-intuitive?
I mean, in regular probability calculus, this is all unproblematic, right? Because of the Tower Rule a.k.a. Law of total expectation or similarly conservation of expected evidence. There are also issues of updatelessness, though, you touch on at various places in the post. E.g., see Almond’s “lack of knowledge is [evidential] power” or scenarios like the Transparent Newcomb’s problem wherein EDT wants to prevent itself from seeing the content of the boxes.
>It seems plausible that evolutionary pressures select for utility functions broadly as ours
Well, at least in some ways similar as ours, right? On questions like whether rooms are better painted red or green, I assume there isn’t much reason to expect convergence. But on questions of whether happiness is better than suffering, I think one should expect evolved agents to mostly give the right answers.
>to compare such maximizations, you already need a decision theory (which tells you what “maximizing your goals” even is).
Incidentally I published a blog post about this only a few weeks ago (which will probably not contain any ideas that are new to you).
>Might there be some situation in which an agent wants to ensure all of its correlates are Good Twins
I don’t think this is possible.
There have been discussions of the suffering of wild animals. David Pearce discusses this, see one of the other comment threads. Some other starting points:
>As a utilitarian then, it should be far more important to wipe out as many animal habitats as possible rather than avoiding eating a relatively small number of animals by being a vegan.
To utilitarians, there are other considerations in assessing the value of wiping out animal habitats, like the effect of such habitats on global warming.
Nice post!
What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans?Perhaps relatedly:
>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?
I don’t get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, “Can this design be turned into a bomb?” and the AI says, “No, it’s safe for such and such reasons”, then if you can’t evaluate these reasons, it doesn’t help you that you have asked the right question.
Sounds interesting! Are you going to post the reading list somewhere once it is completed?
(Sorry for self-promotion in the below!)
I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.
Here’s a pitch in the language of incentivizing AI systems—the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the paper, we specifically imagine that the agent makes recommendations to a principal who then takes the recommended action.) Note that if the predictions are seen by humanity, they themselves influence the world. So even a pure oracle AI might satisfy 2, as has been discussed before (see end of this comment).
We want to design a reward system for this agent such the agent maximizes its reward by making accurate predictions and taking actions that maximize our, the principals’, utility.The challenge is that if we reward the accuracy of the agent’s predictions, we may set an incentive on the agent to make the world more predictable, which will generally not be aligned without mazimizing our utility.
So how can we properly incentivize the agent? The paper provides a full and very simple characterization of such incentive schemes, which we call proper decision scoring rules:
We show that proper decision scoring rules cannot give the [agent] strict incentives to report any properties of the outcome distribution [...] other than its expected utility. Intuitively, rewarding the [agent] for getting anything else about the distribution right will make him [take] actions whose outcome is easy to predict as opposed to actions with high expected utility [for the principal]. Hence, the [agent’s] reward can depend only on the reported expected utility for the recommended action. [...] we then obtain four characterizations of proper decision scoring rules, two of which are analogous to existing results on proper affine scoring [...]. One of the [...] characterizations [...] has an especially intuitive interpretation in economic contexts: the principal offers shares in her project to the [agent] at some pricing schedule. The price schedule does not depend on the action chosen. Thus, given the chosen action, the [agent] is incentivized to buy shares up to the point where the price of a share exceeds the expected value of the share, thereby revealing the principal’s expected utility. Moreover, once the [agent] has some positive share in the principal’s utility, it will be (strictly) incentivized to [take] an optimal action.
Also see Johannes Treutlein’s post on “Training goals for large language models”, which also discusses some of the above results among other things that seem like they might be a good fit for the reading group, e.g., Armstrong and O’Rourke’s work.
My motivation for working on this was to address issues of decision making under logical uncertainty. For this I drew inspiration from the fact that Garrabrant et al.’s work on logical induction is also inspired by market design ideas (specifically prediction markets).
>Because there’s “always a bigger infinity” no matter which you choose, any aggregation function you can use to make decisions is going to have to saturate at some infinite cardinality, beyond which it just gives some constant answer.
Couldn’t one use a lexicographic utility function that has infinitely many levels? I don’t know exactly how this works out technically. I know that maximizing the expectation of a lexicographic utility function is equivalent to the vNM axioms without continuity, see Blume et al. (1989). But they only mention the case of infinitely many levels in passing.
>We mentioned both.
Did you, though? Besides Roko’s basilisk, the references to acausal trade seem vague, but to me they sound like the kinds that could easily make things worse. In particular, you don’t explicitly discuss superrationality, right?
>Finally, while it might have been a good idea initially to treat Roko’s basilisk as an information hazard to be ignored, that is no longer possible so the marginal cost of mentioning it seems tiny.
I agree that due to how widespread the idea of Roko’s basilisk is, it overall matters relatively little whether this idea is mentioned, but I think this applies similarly in both directions.