Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.
VojtaKovarik
Reiterating two points people already pointed out, since they still aren’t fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)
Proposition 1: This is false, and the proof is wrong. For the same reason that you can get an infinite series (of positive numbers) with a finite sum.
The terminology: I think it is a really bad idea to refer to tokens as “states”, for several reasons. Moreover, these reasons point to fundamental open questions around the simulator framing, and it seems unfortunate to chose terminology which makes these issues confusing/hard to even notice. (Disclaimer: I point out some holes in the simulator framing and suggest improvements. However, I am well aware that all of my suggestions also have holes.)
(1) To the extent that a simulator fully describes some situation that evolves over time, a single token is a too small unit to describe the state of the environment. A single frame of a video (arguably) corresponds to a state. Or perhaps a sentence in a story might (arguably) corresponds to a state. But not a single pixel (or patch) and not a single word.
(2) To the extent that a simulator fully describes some situation that evolves over time, there is no straightforward correspondence between the tokens produced so far and the current state of the environment. To give several examples: The process of tossing a coin repeatedly can be represented by a sequence such as “1 0 0 0 1 0 1 …”, where the current state can be identified with the latest token (and you do not want to identify the current state with the whole sequence). The process of me writing the digits of pi on a paper, one per second, can be described as “3 , 1 4 1 …”—here, you need the full sequence to characterize the current state. Or what if I keep writing different numbers, but get bored with them and switch to new ones after a while: ” pi = 3 , 1 4 1 Stop, got bored. e = 2 , 7. Stop, got bored. sqrt(2) = …”.
(3) It is misleading/false to describe models like GPT as “describing some situation that evolves over time”. Indeed, fiction books and movies do crazy things like jumping from character to character, flashbacks, etc. Non-fiction books are even weirder (could contain snippets of stories, and then non-story things, etc). You could argue that in order to predict a text of a non-fiction book, GPT is simulating the author of that book. But where does this stop? What if the 2nd half of the book is darker because the author got sacked out of his day job and got depressed—are you then simulating the whole world, to predict this thing? If (more advanced) GPT is a simulator in the sense of “evolving situations over time”, then I would like this claim flashed out in detail on the example of (a) non-fiction books, (b) fiction books, and perhaps (c) movies on TV that include commercial breaks.
(4) But most importantly: To the extent that a simulator describes some situation that evolves over time, it only outputs a small portion of the situation that it is “imagining” internally. (For example, you are telling a story about a princess, and you never mention the colour of her dress, despite the princess in your head having blue dress.) So it feels like a type-error to refer to the output as “state”. At best, you could call it something like “rendering of a state”.
Arguably, the output (+ the user input) uniquely determines the internal state of the simulator. So you could perhaps identify the output (+ the user input) with “the internal state of the simulator”. But that seems dangerous and likely to cause reasoning errors.(5) Finally, to make (4) even worse: To the extent that a simulator describes some situation that evolves over time, it is not internally maintaining a single fully fleshed out state that it (probabilistically) evolves over time. Instead, it maintains a set of possible states (macro-state?). And when it generates new responses, it throws out some of the possible states (refines the macro-state?). (For example, in your story about a princess, dress colour is not determined, could be anything. Then somebody asks about the colour, and you need to refine it to blue—which could still mean many different shades of blue.)
---However, even the explanation, given in (5), of what is going on with simulators, is missing some important pieces. Indeed, it doesn’t explain what happens in cases such as “GPT tells the great story about the princess with blue dress, and suddenly the user jumps in and refers to the dress as red”. At the moment, this is my main reason for scepticism about the simulator framing. As result, my current view is that “GPT can act as a simulator” (in the sense of Simulators) but it would be “false” to say that “GPT is a simulator” (in the sense of Simulators).
The following issue seems fundamental and related (though i am not sure how exactly :-) ): There is a difference between things ants could physically do and what they are smart enough to do / what we can cheaply enough explain to them. Similarly for humans: delegating takes work. For example, hiring an IQ 80 cleaner might only be worth it for routine tasks, not for “clean up after this large event and just tell me when it’s done, bye”. Similarly, for some reason I am not supervising 10 master students, even if they were all smarter than me.
Oh, I think I agree—if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly—and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
Hm, I think you are right—as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment—will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) “how much are AI assistants able to speed up alignment vs capability” and (proliferation prevention) “how much can OpenAI prevent them from proliferating to capabilities research”.[1] And then the corresponding more realistic version of the claims would be that:
either (i’) AI assistants will fundamentally be able to speed up alignment much more than capabilities
or (ii’) the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
or (iii’) both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.
Comments:
Regarding (iii’): It seems that in the worlds where (iii’) holds, you could just as well solve alignment without developing AI assistants.
Regarding (i’): Personally I don’t buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i’) seems rather hard to me.)
Regarding (ii’): As before, this seems implausible based on the track record :-).
- ^
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn’t be viewed as negative judgement on the people involved :-).)
Commented in a response to MIRI’s A challenge for AGI organizations, and a challenge for readers, along with other people.
My ~2-hour reaction to the challenge:[1]
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their “alignment plan” make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI’s leadership.(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI’s alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI’s capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don’t have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI’s plans, with my reactions:
(V) Easy alignment / warning shots for misaligned AGI:
”Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]” My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don’t need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don’t expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)(VI) “AGI alignment” isn’t “AGI complete”:
This is already acknowledged in the post: “It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.” However, it isn’t exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that “this time, it will be powerful enough to help with alignment”?(VII) Related assumption: No lethal discontinuities:
The whole post suggest the workflow “new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1”. (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don’t see evidence for “ability to hold off on capabilities research”. What are the organizational procedures allowing this?(VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn’t mention because I didn’t have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
(IX) Regarding “outer alignment alignment”: Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
It is good to at least acknowledge that there might be other parts of AI alignment than just “figuring out learning from human feedback (& human-feedback augmentation)”. However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.(X) Ability to differentially use capabilities progress towards alignment progress:
The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won’t also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.(XI) Creating an aligned AI is sufficient for getting AI to go well:
The plan doesn’t say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no “explicit assumptions & detection system & course-correction-procedure” for “nothing will go wrong if we just do (b)”.
General complaint: The plan is not a plan at all! It’s just a meta-plan.(XII) Ultimately, I would paraphrase the plan-as-stated as: “We don’t know how to solve alignment. It seems hard. Let’s first build an AI to make us smarter, and then try again.” I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like—and if so, what is it.
(XIII) For example, the post mentions that “robustness and interpretability research [is important for the plan]”. However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn’t make it any less of an issue!) This means that the plan is not detailed enough.
As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can’t see the step that would fail if X was untrue. This doesn’t say anything good about your proof.
- ^
Eliezer adds: “For this reason, please note explicitly if you’re saying things that you heard from a MIRI person at a gathering, or the like.”
As far as I know, I came up with points (I), (III), and (XII) myself and I don’t remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI’s publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
- ^
(II) For example, consider the following claim: “We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants.” My reaction: Yes, technically speaking this is true. But likewise—please excuse the jarring analogy—the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn’t it be even better if you personally didn’t plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
- ^
Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).
I think Nate’s original argument holds, but might need a bit if elaboration:
Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, [both] on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, and [...].
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can’t.
So the observation “this seems good and I am super-convincend i could tell if it wasn’t”—by itself—isn’t enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn’t try to deceive you, or why it wouldn’t search through dangerous plans, yada yada. But essentially that means you aren’t relying on the plan verification step anymore.
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven’t figured out good communication practices for the digital age. We don’t have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. (“Enough” is meant to stand for “a lot, but within reach of an early AGI”. Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
Re sharp left turn: Maybe I misunderstand the “sharp left turn” term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get “sharp left turn” with a simulator during training—eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.)
One implication I see is that it if the simulator architecture becomes frequently used, it might be really hard to tell whether a thing is dangerous or not. For example might just behave completely fine with most prompts and catastrophically with some other prompts, and you will never know until you try. (Or unless you do some extra interpretability/other work that doesn’t yet exist.) It would be rather unfortunate if the Vulnerable World Hypothesis was true because of specific LLM prompts :-).
Explanation for my strong downvote/disagreement:
Sure, in the ideal world, this post would have a much better scholarship.In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) “outside of the Overton window, or something”[1].
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the “AI immune system” example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn’t be able to name a strategy towards a pivotal outcome, with no non-Overton elements
Extreme version of this: Any practical-in-our-world strategy towards a pivotal outcome necessarily contains some non-Overton elements
- ^
Substitute your better characterization of the undesirable property here. I will just use “non-Overton” for the purposes of this comment.
- ^
(Not very sure I understood your description right, but here is my take:)
I think your proposal is not explaining some crucial steps, which are in fact hard. In particular, I understood it as “you have AI which can give you blueprints for nano sized machines”. But I think we already have some blueprints, this isn’t an issue. How we assemble them is an issue.
I expect that there will be more issues like this that you would find if you tried writing the plan in more detail.
However, I share the general sentiment behind your post—I also don’t understand why you can’t get some pivotal act by combining human intelligence with some narrow AI. I expect that Eliezer have tried to come up with such combinations and came away with some general takeaways on this being not realistic. But I haven’t done this exercise, so it seems not obvious to me. Perhaps it would be beneficial if many more people tried doing the exercise and then communicated the takeaways.
I definitely agree that (1) “what society wants” is a useful notion and that it is different from (2) “situations in which what society wants deviates from what would be good for its individuals”. I would just argue that given both the historical and SSC-inspired connotations of “Moloch”, this term should be associated with (2) rather than with (1) :-).
This captures a part of my intuitions about Moloch. But I think some conditions need to be added to make it fit properly:
IMO, an important part of Moloch is that the Moloch-preferred state is one that none of the players is happy with. But this post’s definition doesn’t have any condition like that. For example, multiplying all utilities in a Moloch Game seems to still fit the definition of a another Moloch Game. (Another example: take the Prisonner’s Dilemma matrix and change the (D,D) reward to +5, +5. That would still satisfy the definition.)
Personally, the author believes that SPI might “add up to normality”—that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I’m a bit confused by this claim. To me it’s a bit unclear what you mean by “adding up to normality”. (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn’t change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don’t fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, …) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it’s still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren’t justified at all in your post. (E.g., re A1, you’re saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don’t know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)
I definitely don’t think (C) and the “any” variant of (B). Less sure about the “most” variant of (B), but I wouldn’t bet on that either.
I do believe (D), mostly because I don’t think that humans will be able to make the necessary commitments (in the sense mentioned in the thread with Rohin). I am not super sure about (A). My bet is that to the extent that SPI can work for humans, we are already using it (or something equivalent) in most situations. But perhaps some exceptions will work, like the lawyer example? (Although I suspect that our skill at picking hawkish lawyers is stronger than we realize. Or there might be existing incentives where lawyers are being selected for hawkishness, because we are already using them for someting-like-SPI? Overall, I guess that the more one-time-only an event is, the higher is the chance that the pre-existing selection pressures will be weak, and (A) might work.)
Overall I’d have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic).
That is a good point. I will try to expand on it, perhaps at least in a comment here once I have time, or so :-).
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.
Thank you for pointing that out. In all these cases, I actually know that you “stated X”, so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.
Perfect, that is indeed the diffeence. I agree with all of what you write here.
In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option—once SPI is in use—the counterfactual world where there is only demand armed just seems so different. Wouldn’t history need to go very differently? Perhaps it wouldn’t even be clear what “you” is in that world?)
But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)
Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn’t make it into the training data od future models.