Cool. I hadn’t thought to frame those problems in predictor terms, and I agree now that “only matters in multi-agent dilemmas” is incorrect.
That said, it still seems to me like policy selection only matters in situations where, conceptually, winning requires something like multiple agents who run the same decision algorithm meeting and doing a bit of logically-prior coordination, and something kind of like this separates things like transparent Newcomb’s problem (where policy selection is not necessary) from the more coordination-shaped cases. The way the problems are classified in my head still involves me asking myself the question “well, do I need to get together and coordinate with all of the instances of me that appear in the problem logically-beforehand, or can we each individually wing it once we see our observations?“.
If anyone has examples where this classification is broken, I remain curious to hear them. Or, similar question: is there any disagreement on the weakened claim, “policy selection only matters in situations that can be transformed into multi-agent problems, where a problem is said to be ‘multi-agent’ if the winning strategy requires the agents to coordinate logically-before making their observations”?
I think Eliezer’s goal was mainly to illustrate the kind of difficulty FAI is, rather than the size of the difficulty. But they aren’t totally unrelated; basic conceptual progress and coming up with new formal approaches often requires a fair amount of serial time (especially where one insight is needed before you can even start working toward a second insight), and progress is often sporadic compared to more applied/well-understood technical goals.
It would usually be extremely tough to estimate how much work was left if you were actually in the “rocket alignment” hypothetical—e.g., to tell with confidence whether you were 4 years or 20 years away from solving “logical undiscreteness”. In the real world, similarly, I don’t think anyone knows how hard the AI alignment problem is. If we can change the character of the problem from “we’re confused about how to do this in principle” to “we fundamentally get how one could align an AGI in the real world, but we haven’t found code solutions for all the snags that come with implementation”, then it would be much less weird to me if you could predict how much work was still left.
Nate says: “You may have a scenario in mind that I overlooked (and I’d be interested to hear about it if so), but I’m not currently aware of a situation where the 1.1 patch is needed that doesn’t involve some sort of multi-agent coordination. I’ll note that a lot of the work that I (and various others) used to think was done by policy selection is in fact done by not-updating-on-your-observations instead. (E.g., FDT agents refuse blackmail because of the effects this has in the world where they weren’t blackmailed, despite how their observations say that that world is impossible.)”
Nate says: “The main datapoint that Rob left out: one reason we don’t call it UDT (or cite Wei Dai much) is that Wei Dai doesn’t endorse FDT’s focus on causal-graph-style counterpossible reasoning; IIRC he’s holding out for an approach to counterpossible reasoning that falls out of evidential-style conditioning on a logically uncertain distribution. (FWIW I tried to make the formalization we chose in the paper general enough to technically include that possibility, though Wei and I disagree here and that’s definitely not where the paper put its emphasis. I don’t want to put words in Wei Dai’s mouth, but IIRC, this is also a reason Wei Dai declined to be listed as a co-author.)”
My model is that ‘FDT’ is used in the paper instead of ‘UDT’ because:
The name ‘UDT’ seemed less likely to catch on.
The term ‘UDT’ (and ‘modifier+UDT’) had come to refer to a bunch of very different things over the years. ‘UDT 1.1’ is a lot less ambiguous, since people are less likely to think that you’re talking about an umbrella category encompassing all the ‘modifier+UDT’ terms; but it’s a bit of a mouthful.
I’ve heard someone describe ‘UDT’ as “FDT + a theory of anthropics” -- i.e., it builds in the core idea of what we’re calling “FDT” (“choose by imagining that your (fixed) decision function takes on different logical outputs”), plus a view to the effect that decisions+probutilities are what matter, and subjective expectations don’t make sense. Having a name for the FDT part of the view seems useful for evaluating the subclaims separately.
The FDT paper introduces the FDT/UDT concept in more CDT-ish terms (for ease of exposition), so I think some people have also started using ‘FDT’ to mean something like ‘variants of UDT that are more CDT-ish’, which is confusing given that FDT was originally meant to refer to the superset/family of UDT-ish views. Maybe that suggests that researchers feel more of a need for new narrow terms to fill gaps, since it’s less often necessary in the trenches to crisply refer to the superset.
Your comment here makes it sound like the FDT paper said “the difference between UDT 1.1 and UDT 1.0 isn’t important, so we’ll just endorse UDT 1.0”, where what the paper actually says is:
In the authors’ preferred formalization of FDT, agents actually iterate over policies (mappings from observations to actions) rather than actions. This makes a difference in certain multi-agent dilemmas, but will not make a difference in this paper. [...]
As mentioned earlier, the author’s preferred formulation of FDT actually intervenes on the node FDT(−) to choose not an action but a policy which maps inputs to actions, to which the agent then applies her inputs in order to select an action. The difference only matters in multi-agent dilemmas so far as we can tell, so we have set that distinction aside in this paper for ease of exposition.
I don’t know why it claims the difference only crops up in multi-agent dilemmas, if that’s wrong.
The opening’s updated now to try to better hint at this, with: “Somewhere in a not-very-near neighboring world, where science took a very different course…”
Yeah, that article was originally an attempt to “essay-ify” an earlier draft of this very dialogue. But I don’t think the essay version succeeded at communicating the idea very well.
The dialogue is at least better, I think, if you have the relevant context (“MIRI is a math research group that works on AI safety and likes silly analogies”) and know what the dialogue is trying to do (“better pinpoint the way MIRI thinks of our current understanding of AGI alignment, and the way MIRI thinks of its research as relevant to improving our understanding, without trying to argue for those models”).
I hate to say this but I’m taking the side of the Spaceplane designers. Perhaps it’s because it’s what I know.
Three things I think it’s important to note explicitly here:
1. Eliezer’s essay above is just trying to state where he thinks humanity’s understanding of AI alignment is, and where he thinks it ultimately needs to be. The point of the fictional example is to make this view more concrete by explaining it in terms of concepts that we already understand well (rockets, calculus, etc.). None of this is an argument for Eliezer’s view “our understanding of AI alignment is relevantly analogous to the fictional rocket example”, just an attempt to be clearer about what the view even is.
2. “Don’t worry about developing calculus, questioning the geocentric model of the solar system, etc.” is the wrong decision in the fictional example Eliezer provided. You suggest, “once you start getting spaceplanes into orbit and notice that heading right for the moon isn’t making progress, you could probably get together some mathematicians and scrum together a rough model of orbital mechanics in time for the next launch”. I don’t think this is a realistic model of how basic research works. Possibly this is a crux between our models?
3. The value of the rocket analogy is that it describes a concrete “way the world could be” with respect to AI. Once this is added to the set of hypotheses under consideration, the important thing is to try to assess the evidence for which possible world we’re in. “I choose to act as though this other hypothesis is true because it’s what I know” should set off alarm bells in that context, as should any impulse to take the side of Team Don’t-Try-To-Understand-Calculus in the contrived fictional example, because this suggests that your models and choices might be insensitive to whether you’re actually in the kind of world where you’re missing an important tool like calculus.
It’s 100% fine to disagree about whether we are in fact in that world, but any indication that we should unconditionally act as though we’re not in that world—e.g., for reasons other than Bayesian evidence about our environment, or for reasons so strong they’re insensitive even to things as important as “we trying to get to the Moon and we haven’t figured out calculus yet” -- should set off major alarms.
And making a spaceplane so powerful it wrecks the planet if it crashes into it, when you don’t know what you are doing...seems implausible to me.
Eliezer means the rocket analogy to illustrate his views on ‘how well do we understand AI alignment, and what kind of understanding is missing?‘, not ‘how big a deal is it if we mess up?’ AI systems aren’t rockets, so there’s no reason to extend the analogy further. (If we do want to compare flying machines and scientific-reasoning machines on this dimension, I’d call it relevant that flying organs have evolved many times in Nature, and never become globally dominant; whereas scientific-reasoning organs evolved just once, and took over the world very quickly.)
A relevant argument that’s nearby in conceptspace is ‘technologies are rarely that impactful, full stop; so we should have a strong prior that AGI won’t be that impactful either’.
I agree we can make an AI that powerful but I think we would need to know what we are doing. Nobody made fission bombs work by slamming radioactive rocks together, it took a set of millions of deliberate actions in a row, by an army of people, to get to the first nuclear weapon.
Eliezer doesn’t mean to argue that we’ll get to AGI by pure brute force, just more brute force than is needed for safety / robustness / precise targeting. “Build a system that’s really good at scientific reasoning, and only solves the kinds of problems we want it to” is a much more constrained problem than “Build a system that’s really good at scientific reasoning”, and it’s generally hard to achieve much robustness / predictability / deep understanding of very novel software, even when that software isn’t as complex or opaque as a deep net.
It sounds to me like key disagreements might include “how much better at science are the first AGI systems built for science likely to be, compared to humans (who weren’t evolved to do science at all, but accidented into being capable of such)?” and “how many developers are likely to have the insights and other resources needed to design/train/deploy AGI in the first few years?” Your view makes more sense in my head when I imagine a world where AGI yields smaller capability gains, and where there aren’t a bunch of major players who can all deploy AGI within a few years of each other.
Uncontrolled argues along similar lines—that the physics/chemistry model of science, where we get to generalize a compact universal theory from a number of small experiments, is simply not applicable to biology/psychology/sociology/economics and that policy-makers should instead rely more on widespread, continuous experiments in real environments to generate many localized partial theories.
I’ll note that (non-extreme) versions of this position are consistent with ideas like “it’s possible to build non-opaque AGI systems.” The full answer to “how do birds work?” is incredibly complex, hard to formalize, and dependent on surprisingly detailed local conditions that need to be discovered empirically. But you don’t need to understand much of that complexity at all to build flying machines with superavian speed or carrying capacity, or to come up with useful theory and metrics for evaluating “goodness of flying” for various practical purposes; and the resultant machines can be a lot simpler and more reliable than a bird, rather than being “different from birds but equally opaque in their own alien way”.
This isn’t meant to be a response to the entire “rationality non-realism” suite of ideas, or a strong argument that AGI developers can steer toward less opaque systems than AlphaZero; it’s just me noting a particular distinction that I particularly care about.
The relevant realism-v.-antirealism disagreement won’t be about “can machines serve particular functions more transparently than biological organs that happen to serve a similar function (alongside many other functions)?“. In terms of the airplane analogy, I expect disagreements like “how much can marginal effort today increase transparency once we learn how to build airplanes?“, “how much useful understanding are we currently missing about how airplanes work?“, and “how much of that understanding will we develop by default on the path toward building airplanes?“.
It may be worth emphasizing that “plausible ranges of moral weight” are likely to get a lot wider when we move from classical utilitarianism to other reasonably-plausible moral theories (even before we try to take moral uncertainty into account).
This is an interesting point I plausibly haven’t noticed / thought about enough!
I agree with this, and I agree with Luke that non-human animals could plausibly have much higher (or much lower) moral weight than humans, if they turned out to be moral patients at all.
People have been using CEV to refer to both “Personal CEV” and “Global CEV” for a long time—e.g., in the 2013 MIRI paper “Ideal Advisor Theories and Personal CEV.”
I don’t know of any cases of Eliezer using “CEV” in a way that’s clearly inclusive of “Personal” CEV; he generally seems to be building into the notion of “coherence” the idea of coherence between different people. On the other hand, it seems a bit arbitrary to say that something should count as CEV if two human beings are involved, but shouldn’t count as CEV if one human being is involved, given that human individuals aren’t perfectly rational, integrated, unitary agents. (And if two humans is too few, it’s hard to say how many humans should be required before it’s “really” CEV.)
Eliezer’s original CEV paper did on one occasion use “coherence” to refer to intra-agent conflicts:
When people know enough, are smart enough, experienced enough, wise enough, that their volitions are not so incoherent with their decisions, their direct vote could determine their volition. If you look closely at the reason why direct voting is a bad idea, it’s that people’s decisions are incoherent with their volitions.
See also Eliezer’s CEV Arbital article:
Helping people with incoherent preferences
What if somebody believes themselves to prefer onions to pineapple on their pizza, prefer pineapple to mushrooms, and prefer mushrooms to onions? In the sense that, offered any two slices from this set, they would pick according to the given ordering?
(This isn’t an unrealistic example. Numerous experiments in behavioral economics demonstrate exactly this sort of circular preference. For instance, you can arrange 3 items such that each pair of them brings a different salient quality into focus for comparison.)
One may worry that we couldn’t ‘coherently extrapolate the volition’ of somebody with these pizza preferences, since these local choices obviously aren’t consistent with any coherent utility function. But how could we help somebody with a pizza preference like this?
I think that absent more arguing about why this is a bad idea, I’ll probably go on using “CEV” to refer to several different things, mostly relying on context to make it clear which version of “CEV” I’m talking about, and using “Personal CEV” or “Global CEV” when it’s really essential to disambiguate.
“Evolution wasn’t trying to solve the robustness problem at all.”—Agreed that this makes the analogy weaker. And, to state the obvious, everyone doing safety work at MIRI and OpenAI agrees that there’s some way to do neglected-by-evolution engineering work that gets you safe+useful AGI, though they disagree about the kind and amount of work.
The docility analogy seems to be closely connected to important underlying disagreements.
Conversation also continues here.
I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what’s going on? Overall I’m pretty confused on this point.
I read Optimization Amplifies as Scott’s attempt to more explicitly articulate the core claim of Eliezer’s Security Mindset dialogues (1, 2). On this view, making software robust/secure to ordinary human optimization does demand the same kind of approach as making it robust/secure to superhuman optimization. The central disanalogy isn’t “robustness-to-humans requires X while robustness-to-superintelligence requires Y”, but rather “the costs of robustness/security failures tend to be much smaller in the human case than the superintelligence case”.
What Dagon said. Your advice makes sense if the main signal people received is “this received one −5 vote, two −4 votes, one −1 vote, three +1 votes, and five +2 votes”, but not if people are just receiving a “net upvotes” summary number. By default, the aggregate effect of everyone trying to “vote according to what’s really in their heart” and disregard current vote totals is that either (a) lots of content gets absurdly, unwarrantedly high/low karma totals because people’s opinions are correlated, or (b) lots of content gets no upvotes or downvotes at all because people are trying to correct for the possibility that things will be over-voted (even though they can see with their own eyes whether a vote total is currently too high or too low).
Perhaps this is a reason to replace the “net upvotes” system with one that lists the number of votes (at different levels).
If there’s nothing particularly bizarre or inconsistent-seeming about a situation, then I don’t think we should call that situation a “paradox”. E.g., “How did human language evolve?” is an interesting scientific question, but I wouldn’t label it “the language paradox” just because there’s lots of uncertainty spread over many different hypotheses.
I think it’s fine to say that the “Fermi paradox,” in the sense SDO mean, is a less interesting question than “why is the Fermi observation true in our world?“. Maybe some other term should be reserved for the latter problem, like “Great Filter problem”, “Fermi’s question” or “Great Silence problem”. (“Great Filter problem” seems like maybe the best candidate, except it might be too linked to the subquestion of how much the Filter lies in our past vs. our future.)
My instinct is often to upvote or downvote comments/posts based on how much karma I think they should display. E.g., maybe I think two comments by new users both deserve about 10 karma, but one is currently at 10 while the other is currently at 18. I might then strong-downvote the latter comment to bring it to 10, while ignoring the former comment. This is all well and good, except under your system, it would lead to two equally good comments conferring +9 karma on one new user and somewhere between −7 and −15 karma on another.
The ideal solution to this might be for me to try to retrain my voting habits rather than modify the system to accommodate them. This is harder if my voting habits are shared by others, though.
One option might be to weight downvotes more heavily the lower the post/comment’s karma was when the downvote occurred? I’m a lot more willing to downvote (and strong-downvote) something that currently has +70 karma than something that currently has +10 karma, because I’m likelier to think that the +70 is an overestimate and that lowering that total a bit is harmless. But that greater willingness means that my average downvote of a +70 post means a lot less than my average downvote of a +10 post.