EDIT: I wrote this before seeing Paul’s response; hence a significant amount of repetition.
They often seem to emit sentences that are ‘not absurd’, instead of ‘on their mainline’, because they’re mostly trying to generate sentences that pass some shallow checks instead of ‘coming from their complete mental universe.’
Why is this?
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like “in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe”, I’m obviously not claiming that this is a realistic thing that I expect to happen, so it’s not coming from my “complete mental universe”; I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say “maybe X happens”, or “X is not absurd”, I’m saying that my probability distribution assigns non-trivial probability to futures in which X happens. Notably, this is absolutely “coming from my complete mental universe”—the probability distribution is all there is, there’s no extra constraints that take 5% probabilities and drive them down to 0, or whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over. In that setting “talk about the mainline” sounds like “keep just one hypothesis and talk about what it says”; this is not going to give you good results. Maybe more charitably it’s “one hypothesis is going to stably get >50% probability and so you should think about that hypothesis a lot” but I don’t see why that should be true.
Obviously some things do in fact get > 90% probability; if you ask me questions like “what’s the probability that if it rains the sidewalk will be wet” I will totally have a mainline, and there will be edge cases like “what if the rain stopped at the boundary between the sidewalk and the road” but those will be mostly irrelevant. The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I presume that there is an underlying unvoiced argument that goes “Rohin, you say that you have a probability distribution over futures; that implies that you have many, many different consistent worlds in mind, and you are uncertain about which one we’re in, and when you are asked for the probability of X then you sum probabilities across each of the worlds where X holds. This seems wild; it’s such a ridiculously complicated operation for a puny human brain to implement; there’s no way you’re doing this. You’re probably just implementing some simpler heuristic where you look at some simple surface desideratum and go ‘idk, 30%’ out of modesty.”
Obviously I do not literally perform the operation described above, like any bounded agent I have to approximate the ideal. But I do not then give up and say “okay, I’ll just think about a single consistent world and drop the rest of the distribution”, I do my best to represent the full range of uncertainty, attempting to have all of my probabilities on events ground out in specific worlds that I think are plausible, think about some specific worlds in greater detail to see what sorts of correlations arise between different important phenomena, carry out some consistency checks on the probabilities I assign to events to notice cases where I’m clearly making mistakes, etc. I don’t see why “have a mainline” is obviously a better response to our boundedness than the approach I use (if anything, it seems obviously a worse response).
In response to your last couple paragraphs: the critique, afaict, is not “a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those”, but rather “a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd).”
On my understanding of Eliezer’s picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that?
I’ll try to explain the technique and why it’s useful. I’ll start with a non-probabilistic version of the idea, since it’s a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I’m building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy—think energy conservation, or Newton’s Laws, or market efficiency, depending on what kind of systems we’re talking about. My hope/plan is to derive (i.e. prove) some predictions from these conditions, or maybe prove some of the conditions from others.
Before I go too far down the path of proving things from the conditions, I’d like to do a quick check that my conditions are consistent at all. How can I do that? Well, human brains are quite good at constrained optimization, so one useful technique is to look for one example of a system which satisfies all the conditions. If I can find one example, then I can be confident that the conditions are at least not inconsistent. And in practice, once I have that one example in hand, I can also use it for other purposes: I can usually see what (possibly unexpected) degrees of freedom the conditions leave open, or what (possibly unexpected) degrees of freedom the conditions don’t leave open. By looking at that example, I can get a feel for the “directions” along which the conditions do/don’t “lock in” the properties of the system.
(Note that in practice, we often start with an example to which we want our conditions to apply, and we choose the conditions accordingly. In that case, our one example is built in, although we do need to remember the unfortunately-often-overlooked step of actually checking what degrees of freedom the conditions do/don’t leave open to the example.)
What would a probabilistic version of this look like? Well, we have a world model with some (uncertain) constraints in it—i.e. kinds-of-things-which-tend-to-happen, and kinds-of-things-which-tend-to-not-happen. Then, we look for an example which generally matches the kinds-of-things-which-tend-to-happen. If we can find such an example, then we know that the kinds-of-things-which-tend-to-happen are mutually compatible; a high probability for some of them does not imply a low probability for others. With that example in hand, we can also usually recognize which features of the example are very-nailed-down by the things-which-tend-to-happen, and which features have lots of freedom. We may, for instance, notice that there’s some very-nailed-down property which seems unrealistic in the real world; I expect that to be the most common way for this technique to unearth problems.
That’s the role a “mainline” prediction serves. Note that it does not imply the mainline has a high probability overall, nor does it imply a high probability that all of the things-which-tend-to-happen will necessarily occur simultaneously. It’s checking whether the supposed kinds-of-things-which-tend-to-happen are mutually consistent with each other, and it provides some intuition for what degrees of freedom the kinds-of-things-which-tend-to-happen do/don’t leave open.
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that’s because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I’d actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)
To be clear, I do not mean to use the label “mainline prediction” for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.
(Also, it matches up quite well with Nate’s model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)
The technique you described is in fact very useful
If your probability distribution over futures happens to be such that it has a “mainline prediction”, you get significant benefits from that (similar to the benefits you get from the technique you described).
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant?
Uh, I inherited “mainline” from Eliezer’s usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John’s. My natural word for it is a ‘particle’, from particle filtering, as linked in various places, which I think is consistent with John’s description. I’m further guessing that Eliezer’s noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates ‘mainline predictions’ than ‘probability distributions’.
If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy.
Do you feel like you do this ‘sometimes’, or ‘basically always’? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they’re coming from this vs. some other source.
As a specific thing that I have in mind, I think there’s a habit of thinking / discourse that philosophy trains, which is having separate senses for “views in consideration” and “what I believe”, and thinking that statements should be considered against all views in consideration, even ones that you don’t believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself “what would Carol say in response to that?” to things that Bob says to her).
When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the ‘represent the whole discourse’ thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the ‘represent the whole discourse’ thing, of course.]
Do you feel like you do this ‘sometimes’, or ‘basically always’?
I don’t know what “this” refers to. If the referent is “have a concrete example in mind”, then I do that frequently but not always. I do it a ton when I’m not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly run through an execution trace to ensure my program would work, now I do not.)
If the referent is “make statements that reflect my beliefs”, then it depends on context, but in the context of these dialogues, I’m always doing that. (Whereas when I’m writing for the newsletter, I’m more often trying to represent the whole discourse, though the “opinion” sections are still entirely my beliefs.)
whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
I think this is roughly how I’m thinking about things sometimes, tho I’d describe the mainline as the particle with plurality weight (which is a weaker condition than >50%). [I don’t know how Eliezer thinks about things; maybe it’s like this? I’d be interested in hearing his description.]
I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with “the future is hard to predict”, it’s because the hypotheses/particles I’m considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about.
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over.
I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on ‘updating’, and have lots of distributions that aren’t grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on ‘hypotheses’, and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in.
[Like, a thing I find helpful to think about here is where the motive power from Aumann’s Agreement Theorem comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and update on your private info and that my private info is consistent with 40% and your posterior is 60% A, then I update to 48% A, that’s what happened when I further conditioned on knowing that your private info is consistent with that update, and so on. Like we both have to be manipulating functions on the whole shared prior for every update!]
For what it’s worth, I think both styles are pretty useful in the appropriate context. [I am moderately confident this is a situation where it’s worth doing the ‘grounded-in-reality’ particle-filtering approach, i.e. hitting the ‘be concrete’ and ‘be specific’ buttons over and over, and then once you’ve built out one hypothesis doing it again with new samples.]
The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I don’t think I believe the ‘should always have a mainline’ thing, but I do think I want to defend the weaker claim of “it’s worth having a mainline about this.” Like, I think if you’re starting a startup, it’s really helpful to have a ‘mainline plan’ wherein the whole thing actually works, even if you ascribe basically no probability to it going ‘exactly to plan’. Plans are useless, planning is indispensable.
[Also I think it’s neat that there’s a symmetry here about complaining about the uncertainty of the future, which makes sense if we’re both trying to hold onto different pieces of Bayesianism while looking at the same problem.]
If you define “mainline” as “particle with plurality weight”, then I think I was in fact “talking on my mainline” at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about “one of my top 10 particles”.
I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.
I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
Huh, I guess I don’t believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it’s thinking, and yet I remain a black box to myself, I can’t really tell whether or not it’s aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”—but I think I have some more-obviously-alignment-related counterexamples. [Tho if by ‘infinite oversight ability’ you do mean something like ‘logical omniscience’ it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.]
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
(This is a long response to a short section because I think the disagreement here is about something like “how should we reason and communicate about intuitions?”, and so it’s worth expanding on what I think might be the implications of otherwise minor disagreements.)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I am confused what you think I was trying to do with that intuition pump.
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)
EDIT: I wrote this before seeing Paul’s response; hence a significant amount of repetition.
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like “in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe”, I’m obviously not claiming that this is a realistic thing that I expect to happen, so it’s not coming from my “complete mental universe”; I’m just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say “maybe X happens”, or “X is not absurd”, I’m saying that my probability distribution assigns non-trivial probability to futures in which X happens. Notably, this is absolutely “coming from my complete mental universe”—the probability distribution is all there is, there’s no extra constraints that take 5% probabilities and drive them down to 0, or whatever else you might imagine would give you a “mainline”.
As I understand it, when you “talk about the mainline”, you’re supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I’m actually quite confused why anyone thinks “talk about the mainline” is an ideal to which to aspire. What makes you expect that? It’s certainly not privileged based on what we know about idealized rationality; idealized rationality tells you to keep a list of hypotheses that you perform Bayesian updating over. In that setting “talk about the mainline” sounds like “keep just one hypothesis and talk about what it says”; this is not going to give you good results. Maybe more charitably it’s “one hypothesis is going to stably get >50% probability and so you should think about that hypothesis a lot” but I don’t see why that should be true.
Obviously some things do in fact get > 90% probability; if you ask me questions like “what’s the probability that if it rains the sidewalk will be wet” I will totally have a mainline, and there will be edge cases like “what if the rain stopped at the boundary between the sidewalk and the road” but those will be mostly irrelevant. The thing that I am confused by is the notion that you should always have a mainline, especially about something as complicated and uncertain as the future.
I presume that there is an underlying unvoiced argument that goes “Rohin, you say that you have a probability distribution over futures; that implies that you have many, many different consistent worlds in mind, and you are uncertain about which one we’re in, and when you are asked for the probability of X then you sum probabilities across each of the worlds where X holds. This seems wild; it’s such a ridiculously complicated operation for a puny human brain to implement; there’s no way you’re doing this. You’re probably just implementing some simpler heuristic where you look at some simple surface desideratum and go ‘idk, 30%’ out of modesty.”
Obviously I do not literally perform the operation described above, like any bounded agent I have to approximate the ideal. But I do not then give up and say “okay, I’ll just think about a single consistent world and drop the rest of the distribution”, I do my best to represent the full range of uncertainty, attempting to have all of my probabilities on events ground out in specific worlds that I think are plausible, think about some specific worlds in greater detail to see what sorts of correlations arise between different important phenomena, carry out some consistency checks on the probabilities I assign to events to notice cases where I’m clearly making mistakes, etc. I don’t see why “have a mainline” is obviously a better response to our boundedness than the approach I use (if anything, it seems obviously a worse response).
In response to your last couple paragraphs: the critique, afaict, is not “a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those”, but rather “a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd).”
On my understanding of Eliezer’s picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
Relevant Feynman quote:
I’ll try to explain the technique and why it’s useful. I’ll start with a non-probabilistic version of the idea, since it’s a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I’m building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy—think energy conservation, or Newton’s Laws, or market efficiency, depending on what kind of systems we’re talking about. My hope/plan is to derive (i.e. prove) some predictions from these conditions, or maybe prove some of the conditions from others.
Before I go too far down the path of proving things from the conditions, I’d like to do a quick check that my conditions are consistent at all. How can I do that? Well, human brains are quite good at constrained optimization, so one useful technique is to look for one example of a system which satisfies all the conditions. If I can find one example, then I can be confident that the conditions are at least not inconsistent. And in practice, once I have that one example in hand, I can also use it for other purposes: I can usually see what (possibly unexpected) degrees of freedom the conditions leave open, or what (possibly unexpected) degrees of freedom the conditions don’t leave open. By looking at that example, I can get a feel for the “directions” along which the conditions do/don’t “lock in” the properties of the system.
(Note that in practice, we often start with an example to which we want our conditions to apply, and we choose the conditions accordingly. In that case, our one example is built in, although we do need to remember the unfortunately-often-overlooked step of actually checking what degrees of freedom the conditions do/don’t leave open to the example.)
What would a probabilistic version of this look like? Well, we have a world model with some (uncertain) constraints in it—i.e. kinds-of-things-which-tend-to-happen, and kinds-of-things-which-tend-to-not-happen. Then, we look for an example which generally matches the kinds-of-things-which-tend-to-happen. If we can find such an example, then we know that the kinds-of-things-which-tend-to-happen are mutually compatible; a high probability for some of them does not imply a low probability for others. With that example in hand, we can also usually recognize which features of the example are very-nailed-down by the things-which-tend-to-happen, and which features have lots of freedom. We may, for instance, notice that there’s some very-nailed-down property which seems unrealistic in the real world; I expect that to be the most common way for this technique to unearth problems.
That’s the role a “mainline” prediction serves. Note that it does not imply the mainline has a high probability overall, nor does it imply a high probability that all of the things-which-tend-to-happen will necessarily occur simultaneously. It’s checking whether the supposed kinds-of-things-which-tend-to-happen are mutually consistent with each other, and it provides some intuition for what degrees of freedom the kinds-of-things-which-tend-to-happen do/don’t leave open.
Man, I would not call the technique you described “mainline prediction”. It also seems kinda inconsistent with Vaniver’s usage; his writing suggests that a person only has one mainline at a time which seems odd for this technique.
Vaniver, is this what you meant? If so, my new answer is that I and others do in fact talk about “mainline predictions”—for me, there was that whole section talking about natural language debate as an alignment strategy. (It ended up not being about a plausible world, but that’s because (a) Eliezer wanted enough concreteness that I ended up talking about the stupidly inefficient version rather than the one I’d actually expect in the real world and (b) I was focused on demonstrating an existence proof for the technical properties, rather than also trying to include the social ones.)
To be clear, I do not mean to use the label “mainline prediction” for this whole technique. Mainline prediction tracking is one way of implementing this general technique, and I claim that the usefulness of the general technique is the main reason why mainline predictions are useful to track.
(Also, it matches up quite well with Nate’s model based on his comment here, and I expect it also matches how Eliezer wants to use the technique.)
Ah, got it. I agree that:
The technique you described is in fact very useful
If your probability distribution over futures happens to be such that it has a “mainline prediction”, you get significant benefits from that (similar to the benefits you get from the technique you described).
Uh, I inherited “mainline” from Eliezer’s usage in the dialogue, and am guessing that his reasoning is following a process sort of like mine and John’s. My natural word for it is a ‘particle’, from particle filtering, as linked in various places, which I think is consistent with John’s description. I’m further guessing that Eliezer’s noticed more constraints / implied inconsistencies, and is somewhat better at figuring out which variables to drop, so that his cloud is narrower than mine / more generates ‘mainline predictions’ than ‘probability distributions’.
Do you feel like you do this ‘sometimes’, or ‘basically always’? Maybe it would be productive for me to reread the dialogue (or at least part of it) and sort sections / comments by how much they feel like they’re coming from this vs. some other source.
As a specific thing that I have in mind, I think there’s a habit of thinking / discourse that philosophy trains, which is having separate senses for “views in consideration” and “what I believe”, and thinking that statements should be considered against all views in consideration, even ones that you don’t believe. This seems pretty good in some respects (if you begin by disbelieving a view incorrectly, your habits nevertheless gather you lots of evidence about it, which can cause you to then correctly believe it), and pretty questionable in other respects (conversations between Alice and Bob now have to include them shadowboxing with everyone else in the broader discourse, as Alice is asking herself “what would Carol say in response to that?” to things that Bob says to her).
When I imagine dialogues generated by people who are both sometimes doing the mainline thing and sometimes doing the ‘represent the whole discourse’ thing, they look pretty different from dialogues generated by people who are both only doing the mainline thing. [And also from dialogues generated by both people only doing the ‘represent the whole discourse’ thing, of course.]
I don’t know what “this” refers to. If the referent is “have a concrete example in mind”, then I do that frequently but not always. I do it a ton when I’m not very knowledgeable and learning about a thing; I do it less as my mastery of a subject increases. (Examples: when I was initially learning addition, I used the concrete example of holding up three fingers and then counting up two more to compute 3 + 2 = 5, which I do not do any more. When I first learned recursion, I used to explicitly run through an execution trace to ensure my program would work, now I do not.)
If the referent is “make statements that reflect my beliefs”, then it depends on context, but in the context of these dialogues, I’m always doing that. (Whereas when I’m writing for the newsletter, I’m more often trying to represent the whole discourse, though the “opinion” sections are still entirely my beliefs.)
I think this is roughly how I’m thinking about things sometimes, tho I’d describe the mainline as the particle with plurality weight (which is a weaker condition than >50%). [I don’t know how Eliezer thinks about things; maybe it’s like this? I’d be interested in hearing his description.]
I think this is also a generator of disagreements about what sort of things are worth betting on; when I imagine why I would bail with “the future is hard to predict”, it’s because the hypotheses/particles I’m considering have clearly defined X, Y, and Z variables (often discretized into bins or ranges) but not clearly defined A, B, and C variables (tho they might have distributions over those variables), because if you also conditioned on those you would have Too Many Particles. And when I imagine trying to contrast particles on features A, B, and C, as they all make weak predictions we get at most a few bits of evidence to update their weights on, whereas when we contrast them on X, Y, and Z we get many more bits, and so it feels more fruitful to reason about.
I mean, the question is which direction we want to approach Bayesianism from, given that Bayesianism is impossible (as you point out later in your comment). On the one hand, you could focus on ‘updating’, and have lots of distributions that aren’t grounded in reality but which are easy to massage when new observations come in, and on the other hand, you could focus on ‘hypotheses’, and have as many models of the situation as you can ground, and then have to do something much more complicated when new observations come in.
[Like, a thing I find helpful to think about here is where the motive power from Aumann’s Agreement Theorem comes from, which is that when I say 40% A, you know that my private info is consistent with an update of the shared prior whose posterior is 40%, and when you take the shared prior and update on your private info and that my private info is consistent with 40% and your posterior is 60% A, then I update to 48% A, that’s what happened when I further conditioned on knowing that your private info is consistent with that update, and so on. Like we both have to be manipulating functions on the whole shared prior for every update!]
For what it’s worth, I think both styles are pretty useful in the appropriate context. [I am moderately confident this is a situation where it’s worth doing the ‘grounded-in-reality’ particle-filtering approach, i.e. hitting the ‘be concrete’ and ‘be specific’ buttons over and over, and then once you’ve built out one hypothesis doing it again with new samples.]
I don’t think I believe the ‘should always have a mainline’ thing, but I do think I want to defend the weaker claim of “it’s worth having a mainline about this.” Like, I think if you’re starting a startup, it’s really helpful to have a ‘mainline plan’ wherein the whole thing actually works, even if you ascribe basically no probability to it going ‘exactly to plan’. Plans are useless, planning is indispensable.
[Also I think it’s neat that there’s a symmetry here about complaining about the uncertainty of the future, which makes sense if we’re both trying to hold onto different pieces of Bayesianism while looking at the same problem.]
If you define “mainline” as “particle with plurality weight”, then I think I was in fact “talking on my mainline” at some points during the conversation, and basically everywhere that I was talking about worlds (instead of specific technical points or intuition pumps) I was talking about “one of my top 10 particles”.
I think I responded to every request for concreteness with a fairly concrete answer. Feel free to ask me for more concreteness in any particular story I told during the conversation.
Huh, I guess I don’t believe the intuition pump? Like, as the first counterexample that comes to mind, when I imagine having an AGI where I can tell everything about how it’s thinking, and yet I remain a black box to myself, I can’t really tell whether or not it’s aligned to me. (Is me-now the one that I want it to be aligned to, or me-across-time? Which side of my internal conflicts about A vs. B / which principle for resolving such conflicts?)
I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”—but I think I have some more-obviously-alignment-related counterexamples. [Tho if by ‘infinite oversight ability’ you do mean something like ‘logical omniscience’ it does become pretty difficult to find a real counterexample, in part because I can just find the future trajectory with highest expected utility and take the action I take at the start of that trajectory without having to have any sort of understanding about why that action was predictably a good idea.]
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
(This is a long response to a short section because I think the disagreement here is about something like “how should we reason and communicate about intuitions?”, and so it’s worth expanding on what I think might be the implications of otherwise minor disagreements.)
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)