Question for Richard, Paul, and/or Rohin: What’s a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)
So8res(Nate Soares)
In response to your last couple paragraphs: the critique, afaict, is not “a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those”, but rather “a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd).”
On my understanding of Eliezer’s picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
- 2 Mar 2022 19:49 UTC; 6 points) 's comment on Late 2021 MIRI Conversations: AMA / Discussion by (
The way the type corresponds loosely to the “type of agency” (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions () leads to outcome (), and produces from that description an action.
Consider stating an alignment property for on elements of this type. What sort of thing must it say?
Perhaps you wish to say “when is fed the actual description of the world, it selects the best possible action”. Congratulations, in fact exists, it is called . This does not help you.
Perhaps you instead wish to say “when is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute” or whatever. Now, set aside the fact that you won’t find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro’lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you’re leaving a lot of utility on the table; even if that worked, you’re still screwed.
Why are you still screwed? Because the resulting function has the property “if we feed in a correct description of which actions have which utilities, then the optimizer selects an action with high utility”. But an enormous chunk of the hard work is in the creation of that description!
For one thing, while our world may have a simple mathematical description (a la “it’s some quantum fields doing some quantum fielding”), we don’t yet have the true name of our universe yet. For another thing, even if we did, the level of description that an optimizer works with, likely needs to be much coarser than this. For a third thing, even if we had a good coarse-grain description of the world, calculating the consequences that follow from a given action is hard. For a fourth thing, evaluating the goodness of the resulting outcome is hard.
If you can do all those things, then congrats!, you’ve solved alignment (and a good chunk of capabilities). All that’s left is the thing that can operate your description and search through it for high-ranked actions (a remaining capabilities problem).
This isn’t intended to be an argument that there does not exist any logical sentence such that a proof of it would save our skins. I’m trying to say something more like: by the time you can write down the sorts of sentences people usually seem to hope for, you’ve probably solved alignment, and can describe how to build an aligned cognitive system directly, without needing to postulate the indirection where you train up some other system to prove your theorem.
For this reason, I have little hope in sentences of the form “here is an aligned AGI”, on account of how once you can say “aligned” in math, you’re mostly done and probably don’t need the intermediate. Maybe there’s some separate, much simpler theorem that we could prove and save our skins—I doubt we’ll find one, but maybe there’s some simple mathematical question at the heart of some pivotal action, such that a proof one way or the other would suddenly allow humans to… <??? something pivotal, I don’t know, I don’t expect such a thing, don’t ask me>. But nobody’s come up with one that I’ve heard of. And nobody seems close. And nobody even seems to be really trying all that hard. Like, you don’t hear of people talking about their compelling theory of why a given mathematical conjecture is all that stands between humans and <???>, and them banging out the details of their formalization which they expect to only take one more year. Which is, y’know, what it would sound like if they were going to succeed at banging their thing out in five years, and have the pivotal act happen in 15. So, I’m not holding my breath.
(“near-zero” is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing “reduce x-risk to near-zero” with “reduce x-risk to sub-50%”.)
I’d be happy to read it if you’re so inclined and think the prompt would help you refine your own thoughts, but yeah, my anticipation is that it would mostly be updating my (already decent) probability that IB physicalism is a reasonable guess.
A few words on the sort of thing that would update me, in hopes of making it slightly more legible sooner rather than later/never: there’s a difference between giving the correct answer to metaethics (“‘goodness’ refers to an objective (but complicated, and not objectively compelling) logical fact, which was physically shadowed by brains on account of the specifics of natural selection and the ancestral environment”), and the sort of argumentation that, like, walks someone from their confused state to the right answer (eg, Eliezer’s metaethics sequence). Like, the confused person is still in a state of “it seems to me that either morality must be objectively compelling, or nothing truly matters”, and telling them your favorite theory isn’t really engaging with their intuitions. Demonstrating that your favorite theory can give consistent answers to all their questions is something, it’s evidence that you have at least produced a plausible guess. But from their confused perspective, lots of people (including the nihilists, including the Bible-based moral realists) can confidently provide answers that seem superficially consistent.
The compelling thing, at least to me and my ilk, is the demonstration of mastery and the ability to build a path from the starting intuitions to the conclusion. In the case of a person confused about metaethics, this might correspond to the ability to deconstruct the “morality must be objectively compelling, or nothing truly matters” intuition, right in front of them, such that they can recognize all the pieces inside themselves, and with a flash of clarity see the knot they were tying themselves into. At which point you can help them untie the knot, and tug on the strings, and slowly work your way up to the answer.
(The metaethics sequence is, notably, a tad longer than the answer itself.)
(If I were to write this whole concept of solutions-vs-answers up properly, I’d attempt some dialogs that make the above more concrete and less metaphorical, but \shrug.)
In the case of IB physicalism (and IB more generally), I can see how it’s providing enough consistent answers that it counts as a plausible guess. But I don’t see how to operate it to resolve my pre-existing confusions. Like, we work with (infra)measures over , and we say some fancy words about how is our “beliefs about the computations”, but as far as I’ve been able to make out this is just a neato formalism; I don’t know how to get to that endpoint by, like, starting from my own messy intuitions about when/whether/how physical processes reflect some logical procedure. I don’t know how to, like, look inside myself, and find confusions like “does logic or physics come first?” or “do I switch which algorithm I’m instantiating when I drink alcohol?”, and disassemble them into their component parts, and gain new distinctions that show me how the apparent conflicts weren’t true conflicts and all my previous intuitions were coming at things from slightly the wrong angle, and then shift angles and have a bunch of things click into place, and realize that the seeds of the answer were inside me all along, and that the answer is clearly that the universe isn’t really just a physical arrangement of particles (or a wavefunction thereon, w/e), but one of those plus a mapping from syntax-trees to bits (here taking ). Or whatever the philosophy corresponding to “a hypothesis is a ” is supposed to be. Like, I understand that it’s a neat formalism that does cool math things, and I see how it can be operated to produce consistent answers to various philosophical questions, but that’s a long shot from seeing it solve the philosophical problems at hand. Or, to say it another way, answering my confusion handles consistently is not nearly enough to get me to take a theory philosophically seriously, like, it’s not enough to convince me that the universe actually has an assignment of syntax-trees to bits in addition to the physical state, which is what it looks to me like I’d need to believe if I actually took IB physicalism seriously.
This is my view as well,
(I, in fact, lifted it off of you, a number of years ago :-p)
but you still need to handle the dependence on subjective uncertainty.
Of course. (And noting that I am, perhaps, more openly confused about how to handle the subjective uncertainty than you are, given my confusions around things like logical uncertainty and whether difficult-to-normalize arithmetical expressions meaningfully denote numbers.)
Running through your examples:
It’s unclear whether we can have an extraordinarily long-lived civilization …
I agree. Separately, I note that I doubt total Fun is linear in how much compute is available to civilization; continuity with the past & satisfactory completion of narrative arcs started in the past is worth something, from which we deduce that wiping out civilization and replacing it with another different civilization of similar flourish and with 2x as much space to flourish in, is not 2x as good as leaving the original civilization alone. But I’m basically like “yep, whether we can get reversibly-computed Fun chugging away through the high-entropy phase of the universe seems like an empiricle question with cosmically large swings in utility associated therewith.”
But nearly-reversible civilizations can also have exponential returns to the resources they are able to acquire during the messy phase of the universe.
This seems fairly plausible to me! For instance, my best guess is that you can get more than 2x the Fun by computing two people interacting than by computing two individuals separately. (Although my best guess is also that this effect diminishes at scale, \shrug.)
By my lights, it sure would be nice to have more clarity on this stuff before needing to decide how much to rush our expansion. (Although, like, 1st world problems.)
But also it seems quite plausible that our universe is already even-more-exponentially spatially vast, and we merely can’t reach parts of it
Sure, this is pretty plausible, but (arguendo) it shouldn’t really be factoring into our action analysis, b/c of the part where we can’t reach it. \shrug
Perhaps rather than having a single set of physical constants, our universe runs every possible set.
Sure. And again (arguendo) this doesn’t much matter to us b/c the others are beyond our sphere of influence.
Why not all of the above? What if the universe is vast and it allows for very long lived civilization? And once we bite any of those bullets to grant 10^100 more people, then it starts to seem like even less of a further ask to assume that there were actually 10^1000 more people instead
I think this is where I get off the train (at least insofar as I entertain unbounded-utility hypotheses). Like, our ability to reversibly compute in the high-entropy regime is bounded by our error-correction capabilities, and we really start needing to upend modern physics as I understand it to make the numbers really huge. (Like, maybe 10^1000 is fine, but it’s gonna fall off a cliff at some point.)
I have a sense that I’m missing some deeper point you’re trying to make.
I also have a sense that… how to say… like, suppose someone argued “well, you don’t have 1/∞ probability that “infinite utility” makes sense, so clearly you’ve got to take infinite utilities seriously”. My response would be something like “That seems mixed up to me. Like, on my current understanding, “infinite utility” is meaningless, it’s a confusion, and I just operate day-to-day without worrying about it. It’s not so much that my operating model assigns probability 0 to the proposition “infinite utilities are meaningful”, as that infinite utilities simply don’t fit into my operating model, they don’t make sense, they don’t typecheck. And separately, I’m not yet philosophically mature, and I can give you various meta-probabilities about what sorts of things will and won’t typecheck in my operating model tomorrow. And sure, I’m not 100% certain that we’ll never find a way to rescue the idea of infinite utilities. But that meta-uncertainty doesn’t bleed over into my operating model, and I’m not supposed to ram infinities into a place where they don’t fit just b/c I might modify the type signatures tomorrow.”
When you bandy around plausible ways that the universe could be real large, it doesn’t look obviously divergent to me. Some of the bullets you’re handling are ones that I am just happy to bite, and others involve stuff that I’m not sure I’m even going to think will typecheck, once I understand wtf is going on. Like, just as I’m not compelled by “but you have more than 0% probability that ‘infinite utility’ is meaningful” (b/c it’s mixing up the operating model and my philosophical immaturity), I’m not compelled by “but your operating model, which says that X, Y, and Z all typecheck, is badly divergent”. Yeah, sure, and maybe the resolution is that utilities are bounded, or maybe it’s that my operating model is too permissive on account of my philosophical immaturity. Philosophical immaturity can lead to an operating model that’s too permisive (cf. zombie arguments) just as easily as one that’s too strict.
Like… the nature of physical law keeps seeming to play games like “You have continua!! But you can’t do an arithmetic encoding. There’s infinite space!! But most of it is unreachable. Time goes on forever!! But most of it is high-entropy. You can do reversible computing to have Fun in a high-entropy universe!! But error accumulates.” And this could totally be a hint about how things that are real can’t help but avoid the truly large numbers (never mind the infinities), or something, I don’t know, I’m philisophically immature. But from my state of philosophical immaturity, it looks like this could totally still resolve in a “you were thinking about it wrong; the worst enhugening assumptions fail somehow to typecheck” sort of way.
Trying to figure out the point that you’re making that I’m missing, it sounds like you’re trying to say something like “Everyday reasoning at merely-cosmic scales already diverges, even without too much weird stuff. We already need to bound our utilities, when we shift from looking at the milk in the supermarket to looking at the stars in the sky (nevermind the rest of the mathematical multiverse, if there is such a thing).” Is that about right?
If so, I indeed do not yet buy it. Perhaps spell it out in more detail, for someone who’s suspicious of any appeals to large swaths of terrain that we can’t affect (eg, variants of this universe w/ sufficiently different cosmological constants, at least in the regions where the locals aren’t thinking about us-in-particular); someone who buys reversible computing but is going to get suspicious when you try to drive the error rate to shockingly low lows?
To be clear, insofar as modern cosmic-scale reasoning diverges (without bringing in considerations that I consider suspicious and that I suspect I might later think belong in the ‘probably not meaningful (in the relevant way)’ bin), I do start to feel the vice grips on me, and I expect I’d give bounded utilities another look if I got there.
Those & others. I flailed towards a bunch of others in my thread w/ Paul. Throwing out some taglines:
“does logic or physics come first???”
“does it even make sense to think of outcomes as being mathematical universes???”
“should I even be willing to admit that the expression “3^^^3″ denotes a number before taking time proportional to at least log(3^^^3) to normalize it?”
“is the thing I care about more like which-computations-physics-instantiates, or more like the-results-of-various-computations??? is there even a difference?”
“how does the fact that larger quantum amplitudes correspond to more magical happening-ness relate to the question of how much more I should care about a simulation running on a computer with wires that are twice as thick???”
Note that these aren’t supposed to be particularly well-formed questions. (They’re more like handles for my own confusions.)
Note that I’m open to the hypothesis that you can resolve some but not others. From my own state of confusion, I’m not sure which issues are interwoven, and it’s plausible to me that you, from a state of greater clarity, can see independences that I cannot.
Note that I’m not asking for you to show me how IB physicalism chooses a consistent set of answers to some formal interpretations of my confusion-handles. That’s the sort of (non-trivial and virtuous!) feat that causes me to rate IB physicalism as a “plausible guess”.
In the specific case of IB physicalism, I’m like “maaaybe? I don’t yet see how to relate this Γ that you suggestively refer to as a ‘map from programs to results’ to a philosophical stance on computation and instantiation that I understand” and “I’m still not sold on the idea of handling non-realizability with inframeasures (on account of how I still feel confused about a bunch of things that inframeasures seem like a plausible guess for how to solve)” and etc.
Maybe at some point I’ll write more about the difference, in my accounting, between plausible guesses and solutions.
I am definitely entertaining the hypothesis that the solution to naturalism/anthropics is in no way related to unbounded utilities. (From my perspective, IB physicalism looks like a guess that shows how this could be so, rather than something I know to be a solution, ofc. (And as I said to Paul, the observation that would update me in favor of it would be demonstrated mastery of, and unravelling of, my own related confusions.))
Ok, cool, I think I see where you’re coming from now.
I don’t think this is unlisted though …
Fair! To a large degree, I was just being daft. Thanks for the clarification.
It seems to me that our actual situation (i.e. my actual subjective distribution over possible worlds) is divergent in the same way as the St Petersburg lottery, at least with respect to quantities like expected # of happy people.
I think this is a good point, and I hadn’t had this thought quite this explicitly myself, and it shifts me a little. (Thanks!)
(I’m not terribly sold on this point myself, but I agree that it’s a crux of the matter, and I’m sympathetic.)
But at that point it seems much more likely that preferences just aren’t defined over probability distributions at all
This might be where we part ways? I’m not sure. A bunch of my guesses do kinda look like things you might describe as “preferences not being defined over probability distributions” (eg, “utility is a number, not a function”). But simultaneously, I feel solid in my ability to use probabliity distributions and utility functions in day-to-day reasoning problems after I’ve chunked the world into a small finite number of possible actions and corresponding outcomes, and I can see a bunch of reasons why this is a good way to reason, and whatever the better preference-formalism turns out to be, I expect it to act a lot like probability distributions and utility functions in the “local” situation after the reasoner has chunked the world.
Like, when someone comes to me and says “your small finite considerations in terms of actions and outcomes are super simplified, and everything goes nutso when we remove all the simplifications and take things to infinity, but don’t worry, sanity can be recovered so long as you (eg) care less about each individual life in a big universe than in a small universe”, then my response is “ok, well, maybe you removed the simplifications in the wrong way? or maybe you took limits in a bad way? or maybe utility is in fact bounded? or maybe this whole notion of big vs small universes was misguided?”
It looks to me like you’re arguing that one should either accept bounded utilities, or reject the probability/utility factorization in normal circumstances, whereas to me it looks like there’s still a whole lot of flex (ex: ‘outcomes’ like “I come back from the store with milk” and “I come back from the store empty-handed” shouldn’t have been treated the same way as ‘outcomes’ like “Tegmark 3 multiverse branch A, which looks like B” and “Conway’s game of life with initial conditions X, which looks like Y”, and something was going wrong in our generalization from the everyday to the metaphysical, and we shouldn’t have been identifying outcomes with universes and expecting preferences to be a function of probability distributions on those universes, but thinking of “returning with milk” as an outcome is still fine).
And maybe you’d say that this is just conceding your point? That when we pass from everyday reasoning about questions like “is there milk at the store, or not?” to metaphysical reasoning like “Conway’s Life, or Tegmark 3?”, we should either give up on unbounded utilities, or give up on thinking of preferences as defined on probability distributions on outcomes? I more-or-less buy that phrasing, with the caveat that I am open to the weak-point being this whole idea that metaphysical universes are outcomes and that probabilities on outcome-collections that large are reasonable objects (rather than the weakpoint being the probablity/utility factorization per se).
it seems odd to hold onto probability distributions as the object of preferences while restricting the space of probability distributions far enough that they appear to exclude our current situation
I agree that would be odd.
One response I have is similar to the above: I’m comfortable using probability distributions for stuff like “does the store have milk or not?” and less comfortable using them for stuff like “Conway’s Life or Tegmark 3?”, and wouldn’t be surprised if thinking of mathematical universes as “outcomes” was a Bad Plan and that this (or some other such philosophically fraught assumption) was the source of the madness.
Also, to say a bit more on why I’m not sold that the current situation is divergent in the St. Petersburg way wrt, eg, amount of Fun: if I imagine someone in Vegas offering me a St. Petersburg gamble, I imagine thinking through it and being like “nah, you’d run out of money too soon for this to be sufficiently high EV”. If you’re like “ok, but imagine that the world actually did look like it could run the gamble infinitely”, my gut sense is “wow, that seems real sus”. Maybe the source of the susness is that eventually it’s just not possible to get twice as much Fun. Or maybe it’s that nobody anywhere is ever in a physical position to reliably double the amount of Fun in the region that they’re able to affect. Or something.
And, I’m sympathetic to the objection “well, you surely shouldn’t assign probability less than <some vanishingly small but nonzero number> that you’re in such a situation!”. And maybe that’s true; it’s definitely on my list of guesses. But I don’t by any means feel forced into that corner. Like, maybe it turns out that the lightspeed limit in our universe is a hint about what sort of universes can be real at all (whatever the heck that turns out to mean), and an agent can’t ever face a St. Petersburgish choice in some suitably general way. Or something. I’m more trying to gesture at how wide the space of possibilities seems to me from my state of confusion, than to make specific counterproposals that I think are competitive.
(And again, I note that the reason I’m not updating (more) towards your apparently-narrower stance, is that I’m uncertain about whether you see a narrower space of possible resolutions on account of being less confused than I am, vs because you are making premature philosophical commitments.)
To be clear, I agree that you need to do something weirder than “outcomes are mathematical universes, preferences are defined on (probability distributions over) those” if you’re going to use unbounded utilities. And again, I note that “utility is bounded” is reasonably high on my list of guesses. But I’m just not all that enthusiastic about “outcomes are mathematical universes” in the first place, so \shrug.
The fact that B can never come about in reality doesn’t really change the situation, you still would have expected consistently-correct intuitions to yield consistent answers.
I think I understand what you’re saying about thought experiments, now. In my own tongue: even if you’ve convinced yourself that you can’t face a St. Petersburg gamble in real life, it still seems like St. Petersburg gambles form a perfectly lawful thought experiment, and it’s at least suspicious if your reasoning procedures would break down facing a perfectly lawful scenario (regardless of whether you happen to face it in fact).
I basically agree with this, and note that, insofar as my confusions resolve in the “unbounded utilities” direction, I expect some sort of account of metaphysical/anthropic/whatever reasoning that reveals St. Petersburg gambles (and suchlike) to be somehow ill-conceived or ill-typed. Like, in that world, what’s supposed to happen when someone is like “but imagine you’re offered a St. Petersburg bet” is roughly the same as what’s supposed to happen when someone’s like “but imagine a physically identical copy of you that lacks qualia”—you’re supposed to say “no”, and then be able to explain why.
(Or, well, you’re always supposed to say “no” to the gamble and be able to explain why, but what’s up for grabs is whether the “why” is “because utility is bounded”, or some other thing, where I at least am confused enough to still have some of my chips on “some other thing”.)
To be explicit, the way that my story continues to shift in response to what you’re saying, is an indication of continued updating & refinement of my position. Yay; thanks.
My point was that this doesn’t seem consistent with anything like a leverage penalty.
I’m not particulalry enthusiastic about “artificial leverage penalties” that manually penalize the hypothesis you can get 3^^^3 happy people by a factor of 1/3^^^3 (and so insofar as that’s what you’re saying, I agree).
From my end, the core of my objection feels more like “you have an extra implicit assumption that lotteries are closed under countable combination, and I’m not sold on that.” The part where I go “and maybe some sufficiently naturalistic prior ends up thinking long St. Petersburg games are ultimately less likely than they are simple???” feels to me more like a parenthetical, and a wild guess about how the weakpoint in your argument could resolve.
(My guess is that you mean something more narrow and specific by “leverage penalty” than I did, and that me using those words caused confusion. I’m happy to retreat to a broader term, that includes things like “big gambles just turn out not to unbalance naturalistic reasoning when you’re doing it properly (eg. b/c finding-yourself-in-the-universe correctly handles this sort of thing somehow)”, if you have one.)
(My guess is that part of the difference in framing in the above paragraphs, and in my original comment, is due to me updating in response to your comments, and retreating my position a bit. Thanks for the points that caused me to update somewhat!)
My point was that we can say lots about which actions are more or less likely to generate 3^^^3 utility even without knowing how the universe got so large.
I agree.
In terms of preferences, I’m just saying...
This seems like a fine guess to me. I don’t feel sold on it, but that could ofc be because you’ve resolved confusions that I have not. (The sort of thing that would persuade me would be you demonstrating at least as much mastery of my own confusions than I possess, and then walking me through the resolution. (Which I say for the purpose of being upfront about why I have not yet updated in favor of this view. In particular, it’s not a request. I’d be happy for more thoughts on it if they’re cheap and you find generating them to be fun, but don’t think this is terribly high-priority.))
That means I care about each individual life less if it happens in a big universe.
I indeed find this counter-intuitive. Hooray for flatly asserting things I might find counter-intuitive!
Let me know if you want me to flail in the direction of confusions that stand between me and what I understand to be your view. The super short version is something like “man, I’m not even sure whether logic or physics comes first, so I get off that train waaay before we get to the Tegmark IV logical multiverse”.
(Also, to be clear, I don’t find UDASSA particularly compelling, mainly b/c of how confused I remain in light of it. Which I note in case you were thinking that the inferential gap you need to span stretches only to UDASSA-town.)
Do you have preferences over the possible outcomes of thought experiments? Does it feel intuitively like they should satisfy dominance principles? If so, it seems like it’s just as troubling that there are thought experiments.
You’ve lost me somewhere. Maybe try backing up a step or two? Why are we talking about thought experiments?
One of my best explicit hypotheses for what you’re saying is “it’s one thing to deny closure of epistemic probabiltiies under countable weighted combination in real life, and another to deny them in thought experiments; are you not concerned that denying them in thought experiments is troubling?”, but this doesn’t seem like a very likely argument for you to be making, and so I mostly suspect I’ve lost the thread.
(I stress again that, from my perspective, the heart of my objection is your implicit assumption that lotteries are closed under countable combination. If you’re trying to object to some other thing I said about leverage penalties, my guess is that I micommunicated my position (perhaps due to a poor choice of words) or shifted my position in response to your previous comments, and that our arguments are now desynched.)
Backing up to check whether I’m just missing something obvious, and trying to sharpen my current objection:
It seems to me that your argument contains a fourth, unlisted assumption, which is that lotteries are closed under countable combination. Do you agree? Am I being daft and missing that, like, some basic weak dominance assumption implies closure of lotteries under countable combination? Assuming I’m not being daft, do you agree that your argument sure seems to leave the door open for people who buy antisymmetry, dominance, and unbounded utilities, but reject countable combination of lotteries?
if it’s possible to have “pivotal” decisions that affect 3^^^3 people, then it’s also possible to have 3^^^3 people in “normal” situations all making their separate (correlated) decisions, eating 3^^^3 sandwiches, and so the stakes of everything are similarly mathematically big.
Agreed.
This seems to put you in a strange position though: you are not only saying that high-value outcomes are unlikely, but that you have no preferences about them. That is, they aren’t merely impossible-in-reality, they are impossible-in-thought-experiments.
Perhaps I’m being dense, but I don’t follow this point. If I deny that my epistemic probabilities are closed under countable weighted sums, and assert that the hypothesis “you can actually play a St. Petersburg game for n steps” is less likely than it is easy-to-describe (as n gets large), in what sense does that render me unable to consider St. Petersburg games in thought experiments?
How would you learn that there may or may not be a 10^100 future people with our choices as the fulcrum? Why would the same process not generalize? (And if it may happen in the future but not now, is that 0 probability?)
The same process generalizes.
My point was not “it’s especially hard to learn that there are 3^^^3 people with our choices as the fulcrum”. Rather, consider the person who says “but shouldn’t our choices be dominated by our current best guesses about what makes the universe seem most enormous, more or less regardless of how implausibly bad those best guesses seem?”. More concretely, perhaps they say “but shouldn’t we do whatever seems most plausibly likely to satisfy the simulator-gods, because if there are simulator gods and we do please them then we could get mathematically large amounts of utility, and this argument is bad but it’s not 1 in 3^^^3 bad, so.” One of my answers to this is “don’t worry about the 3^^^3 happy people until you believe yourself upstream of 3^^^3 happy people in the analogous fashion to how we currently think we’re upstream of 10^50 happy people”.
And for the record, I agree that “maximize option value, figure out what’s going on, stay sane” is another fine response. (As is “I think you have made an error in assessing your insane plan as having higher EV than business-as-usual”, which is perhaps one argument-step upstream of that.)
I don’t feel too confused about how to act in real life; I do feel somewhat confused about how to formally justify that sort of reasoning.
My personal answer is that infinite universes don’t seem infinitely more important than finite universes, and that 2x bigger universes generally don’t seem 2x as important. (I tentatively feel that amongst reasonably-large universes, value is almost independent of size—while thinking that within any given universe 2x more flourishing is much closer to 2x as valuable.)
That sounds like you’re asserting that the amount of possible flourishing limits to some maximum value (as, eg, the universe gets large enough to implement all possible reasonably-distinct combinations of flourishing civilizations)?
I’m sympathetic to this view. I’m not fully sold, of course. (Example confusion between me and that view: I have conflicting intuitions about whether running an extra identical copy of the same simulated happy people is ~useless or ~twice as good, and as such I’m uncertain about whether tiling copies of all combinations of flourishing civilizations is better in a way that doesn’t decay.)
While we’re listing guesses, a few of my other guesses include:
Naturalism resolves the issue somehow. Like, perhaps the fact that you need to be embedded somewhere inside the world with a long St. Petersburg game drives its probability lower than the length of the sentence “a long St. Petersburg game” in a relevant way, and this phenomenon generalizes, or something. (Presumably this would have to come hand-in-hand with some sort of finitist philosophy, that denies that epistemic probabilities are closed under countable combination, due to your argument above.)
There is a maximum utility, namely “however good the best arrangement of the entire mathematical multiverse could be”, and even if it does wind up being the case that the amount of flourishing you can get per-instantiation fails to converge as space increases, or even if it does turn out that instantiating all the flourishing n times is n times as good, there’s still some maximal number of instantiations that the multiverse is capable of supporting or something, and the maximum utility remains well-defined.
The whole utility-function story is just borked. Like, we already know the concept is philosophically fraught. There’s plausibly a utility number, which describes how good the mathematical multiverse is, but the other multiverses we intuitively want to evaluate are counterfactual, and counterfactual mathematical multiverses are dubious above and beyond the already-dubious mathematical multiverse. Maybe once we’re deconfused about this whole affair, we’ll retreat to somewhere like “utility functions are a useful abstraction on local scales” while having some global theory of a pretty different character.
Some sort of ultrafinitism wins the day, and once we figure out how to be suitably ultrafinitist, we don’t go around wanting countable combinations of epistemic probabilities or worrying too much about particularly big numbers. Like, such a resolution could have a flavor where “Nate’s utilities are unbounded” becomes the sort of thing that infinitists say about Nate, but not the sort of thing a properly operating ultrafinitist says about themselves, and things turn out to work for the ultrafinitists even if the infinitists say their utilities are unbounded or w/e.
To be clear, I haven’t thought about this stuff all that much, and it’s quite plausible to me that someone is less confused than me here. (That said, most accounts that I’ve heard, as far as I’ve managed to understand them, sound less to me like they come from a place of understanding, and more like the speaker has prematurely committed to a resolution.)
I think this argument is cool, and I appreciate how distilled it is.
Basically just repeating what Scott said but in my own tongue: this argument leaves open the option of denying that (epistemic) probabilities are closed under countable combination, and deploying some sort of “leverage penalty” that penalizes extremely high-utility outcomes as extremely unlikely a priori.
I agree with your note that the simplicitly prior doesn’t implement leverage penalties. I also note that I’m pretty uncertain myself about how to pull off leverage penalties correctly, assuming they’re a good idea (which isn’t clear to me).
I note further that the issue as I see it arises even when all utilities are finite, but some are (“mathematically”, not merely cosmically) large (where numbers like 10^100 are cosmically large, and numbers like 3^^^3 are mathematically large). Like, why are our actions not dominated by situations where the universe is mathematically large? When I introspect, it doesn’t quite feel like the answer is “because we’re certain it isn’t”, nor “because utility maxes out at the cosmological scale”, but rather something more like “how would you learn that there may or may not be 3^^^3 happy people with your choice as the fulcrum?” plus a sense that you should be suspicious that any given action is more likely to get 3^^^3 utility than any other (even in the presence of Pascall muggers) until you’ve got some sort of compelling account of how the universe ended up so large and you ended up being the fulcrum anyway. (Which, notably, starts to feel intertwined with my confusion about naturalistic priors, and I have at least a little hope that a good naturalistic prior would resolve the issue automatically.)
Or in other words, “can utilities be unbounded?” is a proxy war for “can utilities be mathematically large?”, with the “utilities must be bounded” resolution in the former corresponding (at least seemingly) to “utilities can be at most cosmically large” in the later. And while that may be the case, I don’t yet feel like I understand reasoning in the face of large utilities, and your argument does not dispell my confusion, and so I remain confused.
And, to be clear, I’m not saying that this problem seems intractible to me. There are various lines of attack that seem plausible from here. But I haven’t seen anyone providing the “cognitive recepits” from mapping out those lines of reasoning and deconfusing themselves about big utilities. For all I know, “utilities should be bounded (and furthermore, max utility should be at most cosmically large)” is the right answer. But I don’t confuse this guess for understanding.
I agree that it’s plausible chess-plans can be compressed without invoking full reasoners (and with a more general point that there are degrees of compression you can do short of full-on ‘reasoner’, and with the more specific point that I was oversimplifying in my comment). My intent with my comment was to highlight how “but my AI only generates plans” is sorta orthogonal to the alignment question, which is pushed, in the oracle framework, over to “how did that plan get compressed, and what sort of cognition is invoved in the plan, and why does running that cognition yield good outcomes”.
I have not yet found a pivotal act that seems to me to require only shallow realtime/reactive cognition, but I endorse the exercise of searching for highly specific and implausibly concrete pivotal acts with that property.
This dialog was much less painful for me to read than i expected, and I think it manages to capture at least a little of the version-of-this-concept that I possess and struggle to articulate!
(...that sentence is shorter, and more obviously praise, in my native tongue.)
A few things I’d add (epistemic status: some simplification in attempt to get a gist across):
If there were a list of all the possible plans that cure cancer, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.”
Part of what’s going on here is that reality is large and chaotic. When you’re dealing with a large and chaotic reality, you don’t get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to “unroll” that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like “if the experiments come up this way, then I’ll follow it up with this experiment, and if instead it comes up that way, then I’ll follow it up with that experiment”, and etc. This decision tree quickly explodes in size. And even if we didn’t have a memory problem, we’d have a time problem—the thing to do in response to surprising experimental evidence is often “conceptually digest the results” and “reorganize my ontology accordingly”. If you’re trying to unroll that reasoner into a decision-tree that you can write down in advance, you’ve got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.
Reasoners are a way of compressing plans, so that you can say “do some science and digest the actual results”, instead of actually calculating in advance how you’d digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)
Like, you can’t make an “oracle chess AI” that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You’ve gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.
Like, the outputs you can get out of an oracle AI are “no plan found”, “memory and time exhausted”, “here’s a plan that involves running a reasoner in real-time” or “feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action”. In the first two cases, your oracle is about as useful as a rock; in the third, it’s the realtime reasoner that you need to align; in the fourth, all whe word “oracle” is doing is mollifying you unduly, and it’s this “oracle” that you need to align.
(NB: It’s not obvious to me that cancer cures require routing through enough reality-chaos that plans fully named in advance need to route through reasoners; eg it’s plausible that you can cure cancer with a stupid amount of high-speed trial and error. I know of no pivotal act, though, that looks so easy to me that nonrealtime-plans can avoid the above quadlemma.)
And it’s not obvious to me whether this problem gets better or worse if you’ve tried to train the oracle to only output “reasonable seeming plans”
My point above addresses this somewhat, but I’m going to tack another point on for good measure. Suppose you build an oracle and take the “the plan involves a realtime reasoner” fork of the above quadlemma. How does that plan look? Does the oracle say “build the reasoner using this simple and cleanly-factored mind architecture, which is clearly optimizing for thus-and-such objectives?” If that’s so easy, why aren’t we building our minds that way? How did it solve these alignment challenges that we find so difficult, and why do you believe it solved them correctly? Also, AIs that understand clean mind-architectures seem deeper in the tech tree than AIs that can do some crazy stuff; why didn’t the world end five years before reaching this hypothetical?
Like, specifying a working mind is hard. (Effable, transparent, and cleanly-factored minds are hander still, apparently.) You probably aren’t going to get your first sufficiently-good-reasoner from “project oracle” that’s training a non-interactive system to generate plans so hard that it invents its own mind architectures and describes their deployment, you’re going to get it from some much more active system that is itself a capable mind before it knows how to design a capable mind, like (implausible detail for the purpose of concrete visualization) the “lifelong learner” that’s been chewing through loads and loads of toy environments while it slowly acretes the deep structures of cognition.
Maybe once you have that, you can go to your oracle and be like “ok, you’re now allowed to propose plans that involve deploying this here lifelong learner”, but of course your lifelong learner doesn’t have to be a particularly alignable architecture; its goals don’t have to be easily identifiable and cleanly separable from the rest of its mind.
Which is mostly just providing more implausible detail that makes the “if your oracle emits plans that involve reasoners, then it’s the reasoners you need to align” point more concrete. But… well, I’m also trying to gesture at why the “what if we train the oracle to only output reasonable plans?” thought seems, to me, to come at it from a wrong angle, in a manner that I still haven’t managed to precisely articulate.
(I’m also hoping this conveys at least a little more of why the “just build an oracle that does alignment research” looks harder than doing the alignment research our own damn selves, and I’m frustrated by how people give me a pitying look when I suggest that humanity should be looking for more alignable paradigms, and then turn around and suggest that oracles can do that no-problem. But I digress.)
I never told Jessica not to talk to someone (or at the very least, I don’t recall it and highly doubt it). IIRC, in that time period, Jessica and one other researcher were regularly inviting Michael to the offices and talking to him at length during normal business hours. IIRC, the closest I came to “telling Jessica not to talk to someone” was expressing dissatisfaction with this state of affairs. The surrounding context was that Jessica had suffered performance (or at least Nate-legible-performance) degredation in the previous months, and we were meeting more regularly in attempts to see if we could work something out, and (if memory serves) I expressed skepticism about whether lengthy talks with Michael (in the office, during normal business hours) would result in improvement along that axis. Even then, I am fairly confident that I hedged my skepticism with caveats of the form “I don’t think it’s a good idea, but it’s not my decision”.
Status: writing-while-frustrated. As with the last post, many of Jessica’s claims seem to me to be rooted in truth, but weirdly distorted. (Ever since the end of Jessica’s tenure at MIRI, I have perceived a communication barrier between us that has the weird-distortion nature.)
Meta: I continue to be somewhat hesitant to post stuff like this, on the grounds that it sucks to air dirty laundry about your old employer and then have your old employer drop by and criticize everything you said. I’ve asked Jessica whether she objects to me giving a critical reply, and she said she has no objections, so at least we have that. I remain open to suggestions for better ways to navigate these sorts of situations.
Jessica, I continue to be sad about the tough times you had during the end of your tenure at MIRI, and in the times following. I continue to appreciate your research contributions, and to wish you well.
My own recollections follow. Note that these are limited to cases where Jessica cites me personally, in the interest of time. Note also that I’m not entirely sure I’ve correctly identified the conversations she’s referring to, due to the blurring effects of the perceived distortion, and of time. And it goes almost-without-saying that my own recollections are fallible.
As a MIRI employee I was coerced into a frame where I was extremely powerful and likely to by-default cause immense damage with this power, and therefore potentially responsible for astronomical amounts of harm. I was discouraged from engaging with people who had criticisms of this frame, and had reason to fear for my life if I published some criticisms of it.
My own frame indeed says that the present is the hinge of history, and that humans alive today have extreme ability to affect the course of the future, and that this is especially true of humans working in the AI space, broadly construed. I wouldn’t personally call this “power”—I don’t think anyone in the space has all that much power in the present. I think a bunch of people in the AI-o-sphere have a lowish likelihood of a large amount of future power, and thus high future expected power, which is kinda like power. From my perspective, MIRI researchers have less of this than researchers from current top AI labs, but do have a decent amount. My own model does not predict that MIRI researchers are likely to cause astronomical harm by default. I do not personally adhere to the copenhagen interpretation of ethics, and in the event that humanity destroys itself, I would not be assigning extra blame to alignment researchers on the grounds that they were closer to the action. I’m also not personally very interested in the game of pre-assigning blame, favoring object-level alignment research instead.
Insofar as I was influencing Jessica with my own frame, my best guess is that she misunderstood my frame, as evidenced by these differences between the frame she describes feeling coerced into, and my own picture.
I don’t recall ever discouraging Jessica from engaging with people who had criticisms of my frame. I readily admit that she was talking to folks I had little intellectual respect for, and I vaguely remember some of these people coming up in conversation and me noting that I lacked intellectual respect for them. To the best of my recollection, in all such instances, I added caveats of the form “but, just because I wouldn’t doesn’t mean you shouldn’t”. I readily admit that my openness about my lack of intellectual respect may have been taken as discouragement, especially given my position as her employer. The aforementioned caveats were intended to counteract such forces, at least insofar as I recall.
I was aware at the time that Jessica and I didn’t see eye-to-eye on various issues. I remember at least two occasions where I attempted to explicitly convey that I knew we didn’t see eye-to-eye, that it was OK for her to have views that didn’t match mine, and that I encouraged her to think independently and develop her own views.
Jessica said she felt coerced into a frame she found uncomfortable, and I believe her. My notes here are not intended to cast doubt on the honesty of her reports. My intent in saying all this is merely to express that (1) the frame she reports feeling coerced into, is not one that I recognize, nevermind one that I intentionally coerced her into; and (2) I was aware of the pressures and actively tried to counteract them. Clearly, I failed at this. (And I have a decent chunk of probability mass that Jessica would clarify that she’s not accusing me of intentional coercion.) From my own perspective, she was misreading my own frame and feeling pressured into it despite significant efforts on my part to ameliorate the pressure. I happily solicit advice for what to do better next time, but do not consider my comport to have been a mistake.
talked about hellscapes
I don’t recall ever “talking about hellscapes” per se. I recall mentioning them in passing, rarely. In my recollection, that mainly happened in response to someone else broaching the topic of fates worse than death. (Maybe there were other occasional throwaway references? But I don’t recall them.) My cached reply to others raising the idea of fates worse than death went something like:
“Goal-space is high dimensional, and almost all directions of optimization seem likely to be comparably bad to death from our perspective. To get something that is even vaguely recognizable to human values you have to be hitting a very narrow target in this high-dimensional space. Now, most of that target is plausibly dystopias as opposed to eutopias, because once you’re in the neighborhood, there are a lot of nearby things that are bad rather than good, and value is fragile. As such, it’s reasonable in principle to worry about civilization getting good enough at aiming AIs that they can hit the target but not the bullseye, and so you might worry that that civilization is more likely to create a hellscape than a eutopia. I personally don’t worry about this myself, because it seems to me that the space is so freaking high dimensional and the target so freaking small, that I find it implausible that a civilization successfully able to point an AI in a human-relevant direction, isn’t also able to hit the bullseye. Like, if you’re already hitting a quarter with an arrowhead on the backside of the moon, I expect you can also hit a dime.”
Two reasons I’d defend mentioning hellscapes in such situations: firstly, to demonstrate that I at least plausibly understood the concern my conversation partner had raised (as a matter of course before making a counterpoint), and secondly, so as to not undermine our friends working on S-risk reduction (a program I support).
My reason for not hesitating to use terms like “hellscapes” rather than more banal and less evocative terms was (to the best of my current recollection) out of a desire to speak freely and frankly, at least behind closed doors (eg, in the privacy of the MIRI research space). At the time, there was a bunch of social pressure around to stop thought experiments that end with the AI escaping and eating the galaxy, and instead use thought experiments about AIs that are trying to vacuum the house and accidentally break a lamp or whatever, and this used to rub me the wrong way. The motivations as I previously understood them were that, if you talk about star-eating rather than lamp-breaking, then none of the old guard AI researchers are ever going to take your field seriously. I thought (and still think) this is basically a bad reason. However, I have since learned a new reason, which is that mentioning large-scale disasters freely and frankly, might trigger psychotic episodes in people predisposed to them. I find this a much more compelling reason to elide the high-valence examples.
(Also, the more banal term “S-risk” hadn’t propagated yet, IIRC.)
Regardless, I have never thought in detail about fates worse than death, never mind discussed fates worse than death in any depth. I have no reason to, and I recommend against it. Me occasionally mentioning something in passing, and Jessica glossing it as “Nate talked about it” (with an implication of depth and regularity), is a fine example of the “weird distortion” I perceive in Jessica’s accounts.
I was told, by Nate Soares, that the pieces to make AGI are likely already out there and someone just has to put them together.
I contest this. According to my best recollection of the conversation that I think Jessica is referring to, she was arguing that AGI will not arrive in our own lifetimes, and seemed unresponsive to my attempts to argue that a confident claim of long timelines requires positive knowledge, at which point I exasperatedly remarked that for all we knew, the allegedly missing AGI insights had already been not only had, but published in the literature, and all that remains is someone figuring out how to assemble them. (cf no one knows what science doesn’t know.) I do not assign particularly high credence to this claim myself, and (IIRC) I was using it rhetorically to test for acknowledgement of the idea that confident long timelines require positive knowledge that we seem to lack.
(This seems like another central example of my throwaway lines becoming weirdly distorted and heavily highlighted in Jessica’s recounting.)
He did not tell me anything about how to make such an AGI, on the basis that this would be dangerous.
Here Jessica seems to be implying that, not only did I positively claim that the pieces of AGI were already out there in the literature, but also that I had personally identified them? I deny that, and I’m not sure what claim I made that Jessica misunderstood in that way. Given the surrounding context where Jessica made this claim, my guess is that it was in the same conversation as the exasperated remark described above, and that the conversation past that point became so desynched that Jessica’s recounting is no longer recognizable to me.
To be clear, I have claimed that AI alignment work is sometimes intertwined with AI capabilities work, and I have claimed that capabilities insights shouldn’t be publicized (as a strong default) on account of the negative externalities. Perhaps I said something along those lines that got distorted into Jessica’s claim?
Instead, he encouraged me to figure it out for myself, saying it was within my abilities to do so.
I occasionally recommend that our researchers periodically (every 6mo or so) open a text file and see if they can write pseudocode for an FAI (ignoring computational constraints, at least for now), to help focus their attention on exactly where they’re confused and ground out their alignment research in things that are literally blocking them from actually writing a flippin’ FAI. I don’t recall ever telling Jessica that I thought she could figure out how to build an AGI herself. I do recall telling her I expected she could benefit from the exercise of attempting to write the pseudocode for an FAI.
If memory serves, this is an exercise I’d been advocating for a couple years before the time period that Jessica’s discussing (and IIRC, I’ve seen Jessica advocate it, or more general variants like “what could you do with a hypercomputer on a thumb drive”, as an exercise to potential hires). One guess as to what’s going on is that I tried to advocate the exercise of pseudocoding an FAI as I had many times before, but used some shorthand for it that I thought would be transparent, in some new conversational context (eg, shortly after MIRI switched to non-disclosure by default), and while Jessica was in some new mental state, and Jessica misunderstood me as advocating figuring out how to build an AGI all on her own while insinuating that I thought she could?
[From the comments, in answer to the query “How did you conclude from Nate Soares saying that that the tools to create agi likely already exist that he wanted people to believe he knew how to construct one?”] Because he asked me to figure it out in a way that implied he already had a solution; the assignment wouldn’t make sense if it were to locate a non-workable AGI design (as many AI researchers have done throughout the history of the field); that wouldn’t at all prove that the pieces to make AGI are already out there. Also, there wouldn’t be much reason to think that his sharing a non-workable AGI design with me would be dangerous.
In light of this, my guess is that Jessica flatly misread my implications here.
To be very explicit: Jessica, I never believed you capable of creating a workable AGI design (using, say, your 2017 mind, unaugmented, in any reasonable amount of time). I also don’t assign particularly high credence to the claim that the insights are already out in the literature waiting to be found (or that they were in 2017). Furthermore, I never intentionally implied that I have myself succeeded at the “pseudocode an FAI” exercise so hard as to have an AGI design. Sorry for the miscommunication.
Researchers were told not to talk to each other about research, on the basis that some people were working on secret projects and would have to say so if they were asked what they were working on.
This suggests a picture of MIRI’s nondisclosure-by-default policies that’s much more top-down than reality, similar to a correction I made on a post by Evan Hubinger a few years ago.
The sequence of events as I recall them was: Various researchers wanted to do some closed research. There was much discussion about how much information was private: Research results? Yes, if the project lead wants privacy. Research directions? Yes, if the project lead wants privacy. What about the participant list for each project? Can each project determine their own secrecy bounds individually, or is revealing who’s working with you defecting against (possibly-hypothetical) projects that don’t want to disclose who they’re working with? etc. etc. I recall at least one convo with a bunch of researchers where, in efforts to get everyone to stop circling privacy questions like moths to a flame and get back to the object level research, I said something to the effect of “come to me if you’re having trouble”.
I separately recall Jessica coming to me afterwards and asking a bunch more questions about who she can ask about what. I recall trying to convey something like “just work on what you want to work on, with whatever privacy level you want, and if someone working on something closed wants you working with them they’ll let you know (perhaps through me, if they want to), and you can bang out details with them as need be”.
The fact that people shouldn’t have to reveal whether they are in fact working on closed research if they don’t want to sounds like the sort of thing that came up in one or both of those conversations, and my guess is that that’s what Jessica’s referring to here. From my perspective, that wasn’t a particularly central point, and the point I recall attempting to drive home was more like “let’s just work on the object-level research and not get all wound up around privacy (especially because all that we’ve changed are the defaults, and you’re still completely welcome to publicize your own research, with my full support, as much as you’d like)”.
Nate Soares also wrote a post discouraging people from talking about the ways they believe others to be acting in bad faith.
According to me, I was not trying to say “you shouldn’t talk about ways you believe others to be acting in bad faith”. I was trying to say “I think y’all are usually mistaken when you’re accusing certain types of other people of acting in bad faith”, plus “accusing people of acting in bad faith [in confrontational and adversarial ways, instead of gently clarifying and confirming first] runs a risk of being self-fulfilling, and also burns a commons, and I’m annoyed by the burned commons”. I think those people are wrong and having negative externalities, not that they’re bad for reporting what they believe.
Note that the sorts of talking the post explicitly pushes back against is arguments of the form “person X is gaining [status|power|prestige] through their actions, therefore they are untrustworthy and have bad intentions”, which I believe to be invalid. Had I predicted Jessica’s particular misread in advance, I would have explicitly noted that I’m completely ok with arguments of the form “given observations X and Y, I have non-trivial probability on the hypothesis that you’re acting in bad faith, which I know is a serious allegation. Are you acting in bad faith? If not, how do you explain observations X and Y?”.
In other words, the thing I object to is not the flat statement of credence on the hypothesis “thus-and-such is acting in bad faith”, it’s the part where the author socially rallies people to war on flimsy pretenses.
In other other words, I both believe that human wonkiness makes many people particularly bad at calculating P(bad-faith|the-evidence) in particular, and recommend being extra charitable and cautious when it feels like that probability is spiking. Separately but relatedly, in the moment that you move from stating your own credence that someone is acting in bad faith, to socially accusing someone of acting in bad faith, my preferred norms require a high degree of explicit justification.
And, to be explicit: I think that most of the people who are acting in bad faith will either say “yes I’m acting in bad faith” when you ask them, or will sneer at you or laugh at you or make fun of you instead, which is just as good. I think a handful of other people are harmful to have around regardless of their intentions, and my guess is that most community decisions about harmful people should revolve around harm rather than intent.
(Indeed, I suspect the community needs to lower, rather than raise, the costs of shunning someone who’s doing a lot of harm. But I reiterate that I think such decisions should center on the harm, not the intent, and as such I continue to support the norm that combative/adversarial accusations of ill intent require a high degree of justification.)
Nate Soares expressed discontent that Michael Vassar was talking with “his” employees, distracting them from work.
I don’t actually know what conversation this is referring to. I recall a separate instance, not involving Jessica, of a non-researcher spending lots of time in the office hanging out and talking with one of our researchers, and me pulling the researcher aside and asking whether they reflectively endorsed having those conversations or whether they kept getting dragged into them and then found themselves unable to politely leave. (In that case, the researcher said they reflectively endorsed them, and thereafter I left them alone.)
There might have been a time when Michael Arc (nee Vassar) was spending a lot of time talking to Jessica and one other employee, and I said something about how I don’t have much intellectual respect for Michael? I doubt I said this unsolicited, but I definitely would have said it if anyone asked, and I at least vaguely remember something like that happening once or twice. It’s also possible that towards the end of Jessica’s tenure we were trying to have scheduled meetings to see if we could bridge the communications barrier, and it came up naturally in the conversation? But I’m not sure, as (unlike most of the other claims) I don’t concretely recognize this reference.
It should be noted that, as I was nominally Nate’s employee, it is consistent with standard business practices for him to prevent me from talking with people who might distract me from my work during office hours.
I’m confident I did not prevent anyone from talking to anyone. I occasionally pulled people aside and asked them if they felt trapped in a given conversation when someone was loitering in the office having lots of conversations, so that I could rescue them if need be. I occasionally answered honestly, when asked, what I thought about people’s conversation partners. I leave almost all researchers to their own devices (conversational or otherwise) almost all of the time.
In Jessica’s particular case, she was having a lot of difficulty at the workplace, and so I stepped deeper into the management role than I usually do and we spent more time together seeing whether we could iron out our difficulties or whether we would need to part ways. It’s quite plausible that, during one of those conversations, I noted of my own accord that she was spending lots of her office-time deep in conversation with Michael, and that I didn’t personally expect this to help Jessica get back to producing alignment research that passed my research quality bar. But I am confident that, insofar as I did express my concerns, it was limited to an expression of skepticism. I… might have asked Michael to stop frequenting the offices quite so much? But I doubt it, and I have no recollection of such a thing.
I am confident I didn’t ever tell anyone not to talk to someone else, that feels way out-of-line to me. I may well have said things along the lines of “I predict that that conversation will prove fruitless”, which Jessica interpreted as a guess-culture style command? I tend to couch against that interpretation by adding hedges of the form “but I’m not you” or whatever, but perhaps I neglected to, or perhaps it fell on deaf ears?
Or perhaps Jessica’s just saying something along the lines of “I feared that if I kept talking to Michael all day, I’d be fired, and Nate expressing that he didn’t expect those conversations to be productive was tantamount to him saying that if I continued he’d fire me, which was tantamount to him saying that I can’t talk to Michael”? In which case, my prediction is indeed that if she hadn’t left MIRI of her own accord, and her research performance didn’t rebound, at some point I would have fired her on the grounds of poor performance. And in worlds where Jessica kept talking to Michael all the time, I would have guessed that a rebound was somewhat less likely, because I didn’t expect him to provide useful meta-level or object-level insights that lead to downstream alignment progress. But I’m an empiricist, and I would have happily tested my “talking to Michael doesn’t result in Nate-legible research output” hypothesis, after noting my skepticism in advance.
(Also, for the record, “Nate-legible research output” does not mean “research that is useful according to Nate’s own models”. Plenty of MIRI researchers disagree with me and my frames about all sorts of stuff, and I’m happy to have them at MIRI regardless, given that they’ve demonstrated the ability to seriously engage with the problem. I’m looking for something more like a cohesive vision that the researcher themself believes in, not research that necessarily strikes me personally as directly useful.)
MIRI certainly had a substantially conflict-theoretic view of the broad situation, even if not the local situation. I brought up the possibility of convincing DeepMind people to care about AI alignment. MIRI leaders including Eliezer Yudkowsky and Nate Soares told me that this was overly naive, that DeepMind would not stop dangerous research even if good reasons for this could be given.
I contest this. I endorse talking with leadership at leading AI labs, and have done so in the past, and expect to continue doing so in the future.
It’s true that I don’t expect any of the leading labs to slow down or stop soon enough, and it’s true that I think converging beliefs takes a huge time investment. On the mainline I predict that the required investment won’t in fact be paid in the relevant cases. But, as I told Jessica at the time (IIRC), I expect folks at leading AGI labs to be much more sensitive to solutions to the alignment problem, despite the fact that I don’t think you can talk them into giving up public capabilities research in practice. (This might be what she misunderstood as me saying we’d have better luck “competing”? I don’t recall saying any such thing, but I do recall saying that we’d have better luck solving alignment first and persuading second.)
(And for the record, while I think these big labs are making a mistake, it’s a very easy mistake to make: knowing that you’re in the bad Nash equilibrium doesn’t let you teleport to a better one, and it’s at least understandable that each individual capabilities lab thinks that they’re better than the next guy, or that they can’t get the actual top researchers if they implement privacy protocols right out the gate. It’s an important mistake, but not a weird one that requires positing unusual levels of bad faith.)
In case it’s not clear from the above, I don’t have much sympathy for conflict theory in this, and I definitely don’t think in broadly us-vs-them terms about the AGI landscape. And (as I think I said at the time) I endorse learning how to rapidly converge with people. I recommend figuring out how to more rapidly converge with friends before burning the commons of time-spent-converging-with-busy-people-who-have-limited-attention-for-you, but I still endorse figuring it out. I don’t expect it to work, and I think solving the dang alignment problem on the object-level is probably a better way to convince people to do things differently, but also I will cheer on the sidelines as people try to figure out how to get better and faster at converging their beliefs.
There’s no law saying that, when someone’s making a mistake, there’s some way to explain it to them such that suddenly it’s fixed. I think existing capabilities orgs are making mistakes (at the very least, in publishing capabilities advances (though credit where credit is due, various labs are doing better at keeping their cutting-edge results private, at least until somebody else replicates or nearly-replicates them, than they used to be (though to be clear I think we have a long way to go before I stop saying that I believe I see a big mistake))), and deny the implicit inference from “you can’t quickly convince someone with words that they’re making a mistake” to “you must be using conflict theory”.
I was concerned about the linked misleading statement in 2017 and told Nate Soares and others about it, although Nate Soares insisted that it was not a lie, because technically the word “excited” could indicate the magnitude of a feeling rather than the positiveness of it.
That doesn’t seem to me like a good characterization of my views.
My recollection is that, in my conversation about this topic with Jessica, I was trying to convey something more like “Yeah, I’m pretty worried that they’re going to screw lots of things up. And the overt plan to give AGI to everyone is dumb. But also there are a bunch of sane people trying to redirect OpenAI in a saner direction, and I don’t want to immediately sic our entire community on OpenAI and thereby ruin their chances. This whole thing looks real high-variance, and at the very least this is “exciting” in the sense that watching an adventure movie is exciting, even in the parts where the plot is probably about to take a downturn. That said, there’s definitely a sense in which I’m saying things with more positive connotations than I actually feel—like, I do feel some real positive hope here, but I’m writing lopsidedly from those hopes. This is because the blog post is an official MIRI statement about a new AI org on the block, and my sense of politics says that if a new org appears on the block and you think they’re doing some things wrong, then the politic thing to do initially is talk about their good attributes out loud, while trying to help redirect them in private.”
For the record, I think I was not completely crazy to have some hope about OpenAI at the time. As things played out, they wound up pretty friendly to folks from our community, and their new charter is much saner than their original plan. That doesn’t undo the damage of adding a new capabilities shop at that particular moment in that particular way; but there were people trying behind the scenes, that did in real life manage to do something, and so having some advance hope before they played their hands out was a plausible mistake to make, before seeing the actual underwhelming history unfold.
All that said, I do now consider this a mistake, both in terms of my “don’t rock the boat” communications strategy and in terms of how well I thought things might realistically go at OpenAI if things went well there. I have since updated, and appreciate Jessica for being early in pointing out that mistake. I specifically think I was mistaken in making public MIRI blog posts with anything less than full candor.
While someone bullshitting on the public Internet doesn’t automatically imply they lie to their coworkers in-person, I did not and still don’t know where Nate is drawing the line here.
As I said to Jessica at the time (IIRC), one reason I felt (at the time) that the blog post was fine, is that it was an official MIRI-organization announcement. When speaking as an organization, I was (at the time) significantly more Polite and significantly more Politically Correct and significantly less Dour (and less opinionated and more uncertain, etc).
Furthermore, I (wrongly) expected that my post would not be misleading, because I (wrongly) expected my statements made as MIRI-the-org to be transparently statements made as MIRI-the-org, and for such statements to be transparently Polite and Politically Correct, and thus not very informative one way or another. (In case it wasn’t clear, I now think this was a mistake.)
That said, as I told Jessica at the time (IIRC), you can always just ask me whether I’m speaking as MIRI-the-organization or whether I’m speaking as Nate. Similarly, when I’m speaking as Nate-the-person, you can always just ask me about my honesty protocols.
I have since updated against the idea that I should ever speak as MIRI-the-organization, and towards speaking uniformly with full candor as Nate-the-person. I’m not sure I’ll follow this perfectly (I’d at least slip back into politic-speak if I found myself cornered by a journalist), but again, you can always just ask.
My take on the exercise:
Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one’s probability distribution over AGI, thereby moving out its median further away in time?
Short version: Nah. For example, if you were wrong by dint of failing to consider the right hypothesis, you can correct for it by considering predictable properties of the hypotheses you missed (even if you don’t think you can correctly imagine the true research pathway or w/e in advance). And if you were wrong in your calculations of the quantities you did consider, correction will regress you towards your priors, which are simplicity-based rather than maxent.
Long version: Let’s set aside for the moment the question of what the “correct” maxent distribution on AGI timelines is (which, as others have noted, depends a bit on how you dice up the space of possible years). I don’t think this is where the action is, anyway.
Let’s suppose that we’re an aspiring Bayesian considering that we may have made some mistakes in our calculations. Where might those mistakes have been? Perhaps:
We were mistaken about what we saw (and erroneously updated on observations that we did not make)?
We were wrong in our calculations of quantities of the form P(e|H) (the likelihoods) or P(H) (the priors), or the multiplications thereof?
We failed to consider a sufficiently wide space of hypotheses, in our efforts to complete our updating before the stars burn out?
Set aside for now that the correct answer is “it’s #3, like we might stumble over #1 and #2 every so often but bounded reasoners are making mistake #3 day in and day out, it’s obviously mostly #3”, and take these one at a time:
Insofar as we were mistaken about what we saw, correcting our mistake should involve reverting an update (and then probably making a different update, because we saw something that we mistook, but set that aside). Reverting an update pushes us back towards our prior. This will often increase entropy, but not necessarily! (For example, if we thought we saw a counter-example to gravitation, that update might dramatically increase our posterior entropy, and reverting the update might revert us back to confident narrow predictions about phones falling.) Our prior is not a maxent prior but a simplicity prior (which is important if we ever want to learn anything at all).
Insofar as we were wrong in our calculations of various quantities, correcting our mistake depends on which direction we were wrong, and for which hypotheses. In practice, a reflectively stable reasoner shouldn’t be able to predict the (magnitude-weighted) direction of their error in calculating P(e|H): if we know that we tend to overestimate that value when e is floobish, we can just bump down our estimate whenever e is floobish, until we stop believing such a thing (or, more intelligently, trace down the source of the systematic error and correct it, but I digress). I suppose we could imagine humbly acknowledging that we’re imperfect at estimating quantities of the form P(e|H), and then driving all such estimates towards 1/n, where n is the number of possible observations? This doesn’t seem like a very healthy way to think, but its effect is to again regress us towards our prior. Which, again, is a simplicity prior and not a maxent prior. (If instead we start what-iffing about whether we’re wrong in our intuitive calculations that vaguely correspond to the P(H) quantities, and decide to try to make all our P(H) estimates more similar to each other regardless of H as a symbol of our virtuous self-doubt, then we start regressing towards maximum entropy. We correspondingly lose our ability to learn. And of course, if you’re actually worried that you’re wrong in your estimates of the prior probabilities, I recommend checking whether you think your P(H)-style estimates are too high or two low in specific instances, rather than driving all such estimates to uniformity. But also ¯\_(ツ)_/¯, I can’t argue good priors into a rock.)
Insofar as we were wrong because we were failing to consider a sufficiently wide array of hypotheses, correcting our mistake depends on which hypotheses we’re missing. Indeed, much of Eliezer’s dialog seems to me like Eliezer trying to say “it’s mistake #3 guys, it’s always #3”, plus “just as the hypothesis that we’ll get AGI at 20 watts doesn’t seem relevant because we know that the ways computers consume watts and the ways brains consume watts and they’re radically different, so too can we predict that whatever the correct specific hypothesis for how the first human-attained AGIs consume compute, it will make the amount of compute that humans consume seem basically irrelevant.” Like, if we don’t get AGI till 2050 then we probably can’t consider the correct specific research path, a la #3, but we can predict various properties of all plausible unvisualized paths, and adjust our current probabilities accordingly, in acknowledgement of our current #3-style errors.
In sum: accounting for wrongness should look less like saying “I’d better inject more entropy into my distributions”, and more like asking “are my estimates of P(e|H) off in a predictable direction when e looks like this and H looks like that?”. The former is more like sacrificing some of your hard-won information on the alter of the gods of modesty; the latter is more like considering the actual calculations you did and where the errors might reside in them. And even if you insist on sacrificing some of your information because maybe you did the calculations wrong, you should regress towards a simplicity prior rather than towards maximum entropy (which in practice looks like reaching for fewer and simpler-seeming deep regularities in the world, rather than pushing median AGI timelines out to the year 52,021), which is also how things will look if you think you’re missing most of the relevant information. Though of course, your real mistake was #3, you’re ~always committing mistake #3. And accounting for #3 in practice does tend to involve increasing your error bars until they are wide enough to include the sorts of curveballs that reality tends to throw at you. But the reason for widening your error bars there is to include more curveballs, not just to add entropy for modesty’s sake. And you’re allowed to think about all the predictable-in-advance properties of likely ballcurves even if you know you can’t visualize-in-advance the specific curve that the ball will take.
In fact, Eliezer’s argument reads to me like it’s basically “look at these few and simple-seeming deep regularities in the world” plus a side-order of “the way reality will actually go is hard to visualize in advance, but we can still predict some likely properties of all the concrete hypotheses we’re failing to visualize (which in this case invalidate biological anchors, and pull my timelines closer than 2051)”, both of which seem to me like hallmarks of accounting for wrongness.
Yep!
For sure. It’s tricky to wipe out humanity entirely without optimizing for that in particular—nuclear war, climate change, and extremely bad natural pandemics look to me like they’re at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that’s specifically optimized for this task (than it is to develop AGI), but we don’t see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there’s a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it’s optimizing).