My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about “consequentionalist reasoning”).
VojtaKovarik
My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn’t agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn’t be controversial.
To illustrate on an example: Suppose I want to use a metal rod for a particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard.[1] And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most people to argue that your your car will be exactly as safe in both cases.
Now, the fact that the rod is equally safe in the two cases does not mean that using it in the two cases is equally smart. Indeed, my personal view is that without the safety standard, using the rod for your car would be quite dumb. (And using the metaphorical rod for your superintelligent AI would be f.....g insane.) But I think that saying things like “any model without safety standards is unsafe” is misleading, and likely to be unproductive when communicating with people who have a different mindset.[2]
- ^
Then you might perhaps say that “the metal rod is safe for the purpose of being used for a particular thingy in a car”? However, this itself does not guarantee that the metal rod is actually safe for this purpose—for that to be true, you additionally need the assumption that the safety standard is “reasonable”. Where “reasonable” is undefined. And if you try to define it, you will eventually either need to rely on an informal/undefined/common-sense definition somewhere, or you need to figure out how to formalise literally everything. I am not giving up on that latter goal, but we aren’t there yet :-).
- ^
Personally, I think that being in any sort of decision-making or high-impact role while having such “different mindset” is analogous to driving a bus full of people without having a driver’s license. However, what I think about this makes no meaningful impact on how things work in the real world...
- ^
I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is—provably—guaranteed to have no negative impact on earth.
Proof:
| | | | | | | |
v v v v v v v v
__________ CC
| ___ | CCCC
| / O O \ | :-) CCC :-)
| | o C o | | _|_ || o _|_
| \ o _ / | | ||/ |
|_________ | /\ || /\
--------------------------------------------------------
□
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) “outside of the Overton window, or something”[1].
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the “AI immune system” example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn’t be able to name a strategy towards a pivotal outcome, with no non-Overton elements
Extreme version of this: Any practical-in-our-world strategy towards a pivotal outcome necessarily contains some non-Overton elements
- ^
Substitute your better characterization of the undesirable property here. I will just use “non-Overton” for the purposes of this comment.
- ^
My ~2-hour reaction to the challenge:[1]
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their “alignment plan” make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI’s leadership.(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI’s alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI’s capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don’t have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI’s plans, with my reactions:
(V) Easy alignment / warning shots for misaligned AGI:
”Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]” My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don’t need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don’t expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)(VI) “AGI alignment” isn’t “AGI complete”:
This is already acknowledged in the post: “It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.” However, it isn’t exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that “this time, it will be powerful enough to help with alignment”?(VII) Related assumption: No lethal discontinuities:
The whole post suggest the workflow “new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1”. (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don’t see evidence for “ability to hold off on capabilities research”. What are the organizational procedures allowing this?(VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn’t mention because I didn’t have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
(IX) Regarding “outer alignment alignment”: Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
It is good to at least acknowledge that there might be other parts of AI alignment than just “figuring out learning from human feedback (& human-feedback augmentation)”. However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.(X) Ability to differentially use capabilities progress towards alignment progress:
The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won’t also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.(XI) Creating an aligned AI is sufficient for getting AI to go well:
The plan doesn’t say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no “explicit assumptions & detection system & course-correction-procedure” for “nothing will go wrong if we just do (b)”.
General complaint: The plan is not a plan at all! It’s just a meta-plan.(XII) Ultimately, I would paraphrase the plan-as-stated as: “We don’t know how to solve alignment. It seems hard. Let’s first build an AI to make us smarter, and then try again.” I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like—and if so, what is it.
(XIII) For example, the post mentions that “robustness and interpretability research [is important for the plan]”. However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn’t make it any less of an issue!) This means that the plan is not detailed enough.
As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can’t see the step that would fail if X was untrue. This doesn’t say anything good about your proof.
- ^
Eliezer adds: “For this reason, please note explicitly if you’re saying things that you heard from a MIRI person at a gathering, or the like.”
As far as I know, I came up with points (I), (III), and (XII) myself and I don’t remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI’s publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
- ^
(II) For example, consider the following claim: “We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants.” My reaction: Yes, technically speaking this is true. But likewise—please excuse the jarring analogy—the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn’t it be even better if you personally didn’t plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
- ^
Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).
Just a comment, not meant to contradict in the post:
Indeed, to kill literally everyone, you don’t need to kill everyone at the same time. You only need to weaken/control civilization enough such that it can’t threaten you. And then mop up the remainders when it becomes relevant. (But yes, you do need some way of operating in the real world at some point, either through robots or through getting some humans to do the work for you. But since enslavement worked in the past, the latter shouldn’t be hard.)
tl;dr: It seems noteworthy that “deepware” has strong connotations with “it involves magic”, while the same is not true for AI in general.
I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)As we go along the progression “Tools > Machines > Electric > Electronic > Digital”, most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For example: A hammer does basically no automation. > A machine like a lawn-mower automates a rigidly-designed rotation of the blades. > An electric kettle does-its-thingy. > An electronic calculator automates calculating algorithms that we understand, but can do it for much larger inputs than we could handle. > An algorithm like Monte Carlo tree search automates an abstract reasoning process that we understand, but can apply it to a wide range of domains.]But then it seems that this progression does not neatly continue to the AI paradigm. Or rather, some things that we call AI can be viewed as a continuation of this progression, while others can’t (or would constitute a discontinuous jump).
[For example, approaches like “solving problems using HCH” (minus the part where you use unknown magic to obtain a black box that imitates the human) can be viewed as automating a reasonably-well-understood process (of solving tasks by decomposing & delegating them). But there are also other things that we call AI that are not well described as a continuation of this progression—or perhaps they constitute a rather extreme jump. On the other hand, deep learning automates the not-well-understood process of “stare at many things, then use magic to generalise”. And the other example is abstract optimisation, which automates the not-well-understood process of “search through many potential solutions and pick the one that scores the best according to an objective function”. And there are examples that lie somewhere inbetween—for example, AlphaZero is mostly a quite well-understood process, but it does involve some opaque deep learning.]I suppose we could refer to the distinction as “does it involve magic?”. It then seems noteworthy that “deepware” has strong connotations with magic, while the same isn’t true for all types of AI.[2]
- ^
Or perhaps just “many”? I am not quite sure, this would require going through more examples, and I was intending for this to be a quick comment.
- ^
To be clear, I am not super-confident that this progression is a legitimate phenomenon. But for the sake of argument, let’s say it is.
- ^
An interesting open question is how large hit to competitiveness would we suffer if we restricted ourselves to systems that only involve a small amount of magic.
- ^
This post seems related to an exaggerated version of what I believe: Humans are so far from “agents trying to maximize utility” that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn’t mean it’s sufficient to understand the latter. Just that it’s a prerequisite.)
As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)
Two arguments against “you must not be Dutch-bookable” that feel vaguely relevant here:
1) The _extent_ to which you are Dutch-bookable might matter. IE, if you can pump $X per day from me, that only matters for large X. So viewing Dutch-bookability as binary might be misleading.
2) Even if you are _in theory_ Dutch-bookable, it only matters if you can _actually_ be Dutch-booked. EG, if I am the meanest thing in the universe that controls everything (say, a singleton AI), I could probably ensure that I won’t get into situation where my incoherent goals could hurt me.
My takeaway: It shouldn’t be necessary to build AI with a utility function. And it isn’t sufficient to ony defend against misaligned AIs with a utility function.
I want to flag that the overall tone of the post is in tension with the dislacimer that you are “not putting forward a positive argument for alignment being easy”.
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as “your probability of scheming was >95%, and this was based basically only on this particular version of the ‘counting argument’ ”. That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it—then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn’t mean our intuitions cannot be more or less detailed. It’s just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that “scheming is instrumental for a large class of goals” makes a huge contribution to my beliefs (of “something between 10% and 99% on alignment being hard”), while the particular version of the ‘counting argument’ that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: “So, we don’t have any rigorous arguments about AI risk being real or not, and we won’t have them for quite a while yet. Should we be super-careful about it, just in case?”. But I do think that is appropriate.
My reaction to this is something like:
Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks—at least in some cases, and we don’t have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don’t care? Suppose you managed to demonstrate failures like this in settings X, Y, and Z—would this change anything? And also, when do these failures actually matter? [Not saying they don’t, just that we should think about it.]
To elaborate:
If you understand neural networks (and how Go algorithms use them), it should be obvious that these algorithms might in principle have various vulnerabilities. You might become more confident about this once you learn about adversarial examples for image classifiers or hear arguments like “feed-forward networks can’t represent recursively-defined concepts”. But in a sense, the possibility of vulnerabilities should seem likely to you just based on the fact that neural networks (unlike some other methods) come with no relevant worst-case performance guarantees. (And to be clear, I believe all of this indeed was obvious to the authors since AlphaGo came out.)
So if your application is safety-critical, security mindset dictates that you should not use an approach like this. (Though Go and many other domains aren’t safety-critical, hence my question “when does this matter”.)
Viewed from this perspective, the value added by the paper is not “Superhuman Go AIs have vulnerabilities” but “Remember those obviously-possible vulnerabilities? Yep, it is as we said, it is not too hard to find them”.
Also, I (sadly) expect that reactions to this paper (and similar results) will mostly fall into one of the following two camps: (1) Well, duh! This was obvious. (2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities. I would be hoping for reaction such as (3) [Oh, ok! So failures like this are probably possible for all neural networks. And no safety-critical system should rely on neural networks not having vulnerabilities, got it.] However, I mostly expect that anybody who doesn’t already believe (1) and (3) will just react as (2).
And this motivates my point about “asking something else”. EG, how do people who don’t already believe (3) think about these things, and which arguments would they find persuasive? Is it efficient to just demonstrate as many of these failures as possible? Or are some failures more useful than others, or does this perhaps not help at all? Would it help with “signpost moving” if we first made some people commit to specific predictions (eg, “I believe scale will solve the general problem of robustness, and in particular I think AlphaZero has no such vulnerabilities”).
- ^
At least I remember thinking this when AlphaZero came out. (We did a small project in 2018 where we found a way to exploit AlphaZero in the tiny connect-four game, so this isn’t just misremembering / hindsight bias.)
My impression is that logical counterfactuals, and counterfactuals, and comparability is—at the moment—too confused, and most disagreements here are “merely verbal” ones. Most of your questions (seem to me to) point in the direction of different people using different definitions. I feel slightly worried about going too deep into discussions along the lines of “Vojta reacts to Chris’ claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven’t met” :D.
My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models:
Definition (comparability according to Vojta): Two scenarios are comparable (given model and observation sequence ) if they are both possible in and and consistent with .
According to this view, counterfactuals only make sense if your model contains uncertainty...
(Aside on logical counterfactuals: Note that there is difference between the model that I use and the hypothetical models I would be able to infer were I to use all my knowledge. Indeed, I can happily reason about 6th digit of being 7, since I don’t know what it is, despite knowing the formula for calculating . I would only get into trouble if I were to do the calculations (and process their implications for the real world). Updating your models with new logical information seems like an important problem, but one I think is independent from counterfactual reasoning.)
...however, there remains the fact humans do counterfactual reasoning all the time, even about impossible things (“What if I decided to not write this comment?”, “What if the Sun revolved around the Earth?”). I think this is consistent with the above definition, from three reasons. First, the models that humans use are complicated, fragmented, incomplete, and wrong. So much so that positing logical impossibilities (the Sun going around the Earth thing) doesn’t make the model inconsistent (because it is so fragmented and incomplete). Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history by some alternative . So you then apply the above definition to and (e.g., me not starting to write this comment). When is compatible with the model we use, everything is logically consistent (in ). For example, it might actually be impossible for me to not have started writing this comment, but it was perfectly consistent with my (wrong) model. Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax in some manner. Moreover, people might often implicitly assume same/similar relaxation. For example, suppose I know that the month of May has 31 days. The natural relaxation is to be uncertain about month lengths while still remembering it was something between 28 and 31. I might this say that 30 was a perfectly reasonable length, while being indignant upon being asked to consider May that is 370 days long.
As for the implications for your question: The phrasing of 1) seems to suggest a model that has uncertainty about your decision procedure. Thus picking both 10 and 5 seems possible (and consistent with observation history of seeing the two boxes), and thus comparable. Note that this would seem fishier if you additionally posited that you are a utility maximizer (but, I argue, most people would implicitly relax this assumption if you asked them to consider the 5 counterfactual). Regarding 2) I think that “a typical AF reader” uses a model in which “a typical CDT adherent” can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for “typical AF readers”. I think that “a typical CDT adherent” uses a model in which “CDT adherents” find the box empty while one-boxers find it full, thus making the options incomparable. The third question I didn’t understand.
Disclaimer: I haven’t been keeping up to date on discussions regarding these matters, so it might be that what I write has some obvious and known holes in it...
I guess on first reading, you can cheat by reading the introduction, Section 2 right after that, and the conclusion. One level above that is reading the text but skipping the more technical sections (4 and 5). Or possibly reading 4 and 5 as well, but only focusing on the informal meaning of the formal results.
Regarding the background knowledge required for the paper: It uses some game theory (Nash equilibria, extensive form games) and probability theory (expectations, probability measures, conditional probability). Strictly speaking, you can get all of this from looking up whichever keywords on wikipedia. I think that all of the concepts used there are basic in the corresponding fields, and in particular no special knowledge of measure theory is required. However, I studied both game theory and measure theory, so I am biased, and you shouldn’t trust me. (Moreover, there is a difference between “strictly speaking, only this is needed” and “my intuitions are informed by X, Y, and Z”.)
Another thing is that the AAAI workshop where this will appear has a page limit, which means that some explanations might have gotten less space than they would deserve. In particular, the arguments in Section 4 are much easier to digest if you can draw the functions that the text talks about. To understand the formal results, I think I visualized two-dimensional slices of the “world space” (i.e., squares), and assumed that the value of the function is 0 by default, except for being 1 at some selected subset of the square. This allows you to compute all the expectations and conditionals visually.
To expand on the idea of meta-systems and their capability: Similarly to discussing brain efficiency, we could ask about the efficiency of our civilization (in the sense of being able to point its capability to a unified goal), among all possible ways of organising civilisations. If our civilisation is very inefficient, AI could figure out a better design and foom that way.
Primarily, I think the question of our civilization’s efficiency is unclear. My intuition is that our civilization is quite inefficient, with the following points serving as weak evidence:
Civilization hasn’t been around that long, and has therefore not been optimised much.
The point (1) gets even more pronounced as you go from “designs for cooperation among a small group” to “designs for cooperation among milions”, or even billions. (Because fewer of these were running in parallel, and for a shorter time.)
The fact that civilization runs on humans, who are selfish etc, might severely limit the space of designs that have been tried.
As a lower bound, it seems that something like Yudkowsky’s ideas about dath ilan might work. (Not to be mistaken with “we can get there from here”, “works for humans”, or “none of Yudkowsky’s ideas have holes in them”.)
None of this contradicts your arguments, but it adds uncertainty and should make us more cautios about AI. (Not that I interpret the post as advocating against caution.)
Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple “features”. There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so—by definition—features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a strict subset of the features, all of the non-specified features get thrown under the buss. (B) However, if you can optimise things gradually, then you can alternate which features you focus on, and somehow things will end up being pretty okay.Personal note: Because of the assumption (2), I find the result (A) extremely unsurprising, and perhaps misleading. Yes, it is true that at the Pareto-frontier of resource allocation, there is no space for positive-sum interactions (ie, getting better on some axis must hurt us on some other axis). But the assumption (2) instead claims that positive-sum interactions are literally never possible. This is clearly untrue in the real-world, about things we care about.
That said, I find the result (B) quite interesting, and I don’t mean to hate on the paper :-).
Two comments to this:
1) The scenario described here is a Nash equilibrium but not a subgame-perfect Nash equilibrium. (IE, there are counterfactual parts of the game tree where the players behave “irationally”.) Note that subgame-perfection is orthogonal with “reasonable policy to have”, so the argument “yeah, clearly the solution is to always require subgame-perfection” does not work. (Why is it orthogonal? First, the example from the post shows a “stupid” policy that isn’t subgame-perfect. However, there are cases where subgame-imperfection seems smart, because it ensures that those counterfactual situations don’t become reality. EG, humans are somewhat transparent to each other, so having the policy of refusing unfair splits in the Final Offer / Ultimatum game can lead to not being offered unfair splits in the first place.)2) You could modify the scenario such that the “99 equilibrium” becomes more robust. (EG, suppose the players have a way of paying a bit to punish a specific player a lot. Then you add the norm of turning temperature to 99, the meta-norm of punishing defectors, the meta-meta-norm of punishing those who don’t punish defectors, etc. And tadaaaa, you have a pretty robust hell. This is a part of how society actually works, except usually those norms typically enforce pro-social behaviour.)
if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you—hating me—believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don’t see how this would happen, but that isn’t a particularly strong argument.)
Also, the debaters better be comparably smart.
Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)
I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper’s answer to (2) as the answer to (2).
For what it’s worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).
Explanation for my strong downvote/disagreement:
Sure, in the ideal world, this post would have a much better scholarship.In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)
Intuitively, I agree that the vacation question is under-defined / has too many “right” answers. On the other hand, I can also imagine the world where you can develop some objective fun theory, or just something which actually makes the questions well-posed. And the AIs could use this fact in the debate:
Bob: “Actually, you can derive a well-defined fun theory and use it to answer this question. And then Bali clearly wins.”
Alice: “There could never be any such thing!”
Bob: “Actually, there indeed is such a theory, and its central idea is [...].”
[They go on like this for a bit, and eventually, Bob wins.]
Indeed, this seems like a thing you could (by explaining that integration is a thing) if somebody tried to convince you that there is no principled way to measure the area of a circle.
However—if true—this only shows that there are less under-defined question than we think. The “Ministry of Ambiguity versus the Department of Clarity” fight is still very much a thing, as are the incentives to manipulate the human. And perhaps most importantly, routinely holding debates where the AI “explains to you how to think about something” seems extremely dangerous...
Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn’t make it into the training data od future models.