People have been using CEV to refer to both “Personal CEV” and “Global CEV” for a long time—e.g., in the 2013 MIRI paper “Ideal Advisor Theories and Personal CEV.”
I don’t know of any cases of Eliezer using “CEV” in a way that’s clearly inclusive of “Personal” CEV; he generally seems to be building into the notion of “coherence” the idea of coherence between different people. On the other hand, it seems a bit arbitrary to say that something should count as CEV if two human beings are involved, but shouldn’t count as CEV if one human being is involved, given that human individuals aren’t perfectly rational, integrated, unitary agents. (And if two humans is too few, it’s hard to say how many humans should be required before it’s “really” CEV.)
Eliezer’s original CEV paper did on one occasion use “coherence” to refer to intra-agent conflicts:
When people know enough, are smart enough, experienced enough, wise enough, that their volitions are not so incoherent with their decisions, their direct vote could determine their volition. If you look closely at the reason why direct voting is a bad idea, it’s that people’s decisions are incoherent with their volitions.
See also Eliezer’s CEV Arbital article:
Helping people with incoherent preferences
What if somebody believes themselves to prefer onions to pineapple on their pizza, prefer pineapple to mushrooms, and prefer mushrooms to onions? In the sense that, offered any two slices from this set, they would pick according to the given ordering?
(This isn’t an unrealistic example. Numerous experiments in behavioral economics demonstrate exactly this sort of circular preference. For instance, you can arrange 3 items such that each pair of them brings a different salient quality into focus for comparison.)
One may worry that we couldn’t ‘coherently extrapolate the volition’ of somebody with these pizza preferences, since these local choices obviously aren’t consistent with any coherent utility function. But how could we help somebody with a pizza preference like this?
I think that absent more arguing about why this is a bad idea, I’ll probably go on using “CEV” to refer to several different things, mostly relying on context to make it clear which version of “CEV” I’m talking about, and using “Personal CEV” or “Global CEV” when it’s really essential to disambiguate.
“Evolution wasn’t trying to solve the robustness problem at all.”—Agreed that this makes the analogy weaker. And, to state the obvious, everyone doing safety work at MIRI and OpenAI agrees that there’s some way to do neglected-by-evolution engineering work that gets you safe+useful AGI, though they disagree about the kind and amount of work.
The docility analogy seems to be closely connected to important underlying disagreements.
Conversation also continues here.
I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what’s going on? Overall I’m pretty confused on this point.
I read Optimization Amplifies as Scott’s attempt to more explicitly articulate the core claim of Eliezer’s Security Mindset dialogues (1, 2). On this view, making software robust/secure to ordinary human optimization does demand the same kind of approach as making it robust/secure to superhuman optimization. The central disanalogy isn’t “robustness-to-humans requires X while robustness-to-superintelligence requires Y”, but rather “the costs of robustness/security failures tend to be much smaller in the human case than the superintelligence case”.
What Dagon said. Your advice makes sense if the main signal people received is “this received one −5 vote, two −4 votes, one −1 vote, three +1 votes, and five +2 votes”, but not if people are just receiving a “net upvotes” summary number. By default, the aggregate effect of everyone trying to “vote according to what’s really in their heart” and disregard current vote totals is that either (a) lots of content gets absurdly, unwarrantedly high/low karma totals because people’s opinions are correlated, or (b) lots of content gets no upvotes or downvotes at all because people are trying to correct for the possibility that things will be over-voted (even though they can see with their own eyes whether a vote total is currently too high or too low).
Perhaps this is a reason to replace the “net upvotes” system with one that lists the number of votes (at different levels).
If there’s nothing particularly bizarre or inconsistent-seeming about a situation, then I don’t think we should call that situation a “paradox”. E.g., “How did human language evolve?” is an interesting scientific question, but I wouldn’t label it “the language paradox” just because there’s lots of uncertainty spread over many different hypotheses.
I think it’s fine to say that the “Fermi paradox,” in the sense SDO mean, is a less interesting question than “why is the Fermi observation true in our world?“. Maybe some other term should be reserved for the latter problem, like “Great Filter problem”, “Fermi’s question” or “Great Silence problem”. (“Great Filter problem” seems like maybe the best candidate, except it might be too linked to the subquestion of how much the Filter lies in our past vs. our future.)
My instinct is often to upvote or downvote comments/posts based on how much karma I think they should display. E.g., maybe I think two comments by new users both deserve about 10 karma, but one is currently at 10 while the other is currently at 18. I might then strong-downvote the latter comment to bring it to 10, while ignoring the former comment. This is all well and good, except under your system, it would lead to two equally good comments conferring +9 karma on one new user and somewhere between −7 and −15 karma on another.
The ideal solution to this might be for me to try to retrain my voting habits rather than modify the system to accommodate them. This is harder if my voting habits are shared by others, though.
One option might be to weight downvotes more heavily the lower the post/comment’s karma was when the downvote occurred? I’m a lot more willing to downvote (and strong-downvote) something that currently has +70 karma than something that currently has +10 karma, because I’m likelier to think that the +70 is an overestimate and that lowering that total a bit is harmless. But that greater willingness means that my average downvote of a +70 post means a lot less than my average downvote of a +10 post.
I generally like the hand-written style and would like to see more of it. I’m guessing that style was net-positive for me here (and made me a lot more likely to read the whole thing), though I did experience some reading fatigue 2⁄3 through this post.
When Scott says “mathematician mindset can be useful for AI alignment”, I take it that your interpretation is “we should try to make sure that when we build AGI, we can prove that our system is safe/robust/secure”, whereas I think the intended interpretation is “we should try to make sure that when we build AGI, we have a deep formal understanding of how this kind of system works at all so that we’re not flying blind”. Similar to how we understand the mathematics of how rockets work in principle, and if we found a way to build a rocket without that understanding, it’s very unlikely we’d be able to achieve much confidence in the system’s behavior.
I think the end of this excerpt from a 2000 Bruce Schneier piece is assuming something like this, though I don’t know that Schneier would agree with Eliezer and Scott fully:
Complexity is the worst enemy of security. [...]
The first reason is the number of security bugs. All software contains bugs. And as the complexity of the software goes up, the number of bugs goes up. And a percentage of these bugs will affect security.
The second reason is the modularity of complex systems. [...I]ncreased modularity means increased security flaws, because security often fails where two modules interact. [...]
The third reason is the increased testing requirements for complex systems. [...]
The fourth reason is that the more complex a system is, the harder it is to understand. There are all sorts of vulnerability points — human-computer interface, system interactions — that become much larger when you can’t keep the entire system in your head.
The fifth (and final) reason is the difficulty of analysis. The more complex a system is, the harder it is to do this kind of analysis. Everything is more complicated: the specification, the design, the implementation, the use. And as we’ve seen again and again, everything is relevant to security analysis.
Cf. this thing I said a few months ago:
“Adding conceptual clarity” is a key motivation, but formal verification isn’t a key motivation.
The point of things like logical induction isn’t “we can use the logical induction criterion to verify that the system isn’t making reasoning errors”; as I understand it, it’s more “logical induction helps move us toward a better understanding of what good reasoning is, with a goal of ensuring developers aren’t flying blind when they’re actually building good reasoners”.
Daniel Dewey’s summary of the motivation behind HRAD is:
“2) If we fundamentally ‘don’t know what we’re doing’ because we don’t have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.
“3) Even minor mistakes in an advanced AI system’s design are likely to cause catastrophic misalignment.”
To which Nate replied at the time:
“I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as ‘There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.’ I’d compare these mistakes to the ‘small’ decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
“I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running.”
The position of the AI community is something like the position researchers would be in if they wanted to build a space rocket, but hadn’t developed calculus or orbital mechanics yet. Maybe with enough trial and error (and explosives) you’ll eventually be able to get a payload off the planet that way, but if you want things to actually work correctly on the first go, you’ll need to do some basic research to cover core gaps in what you know.
To say that calculus or orbital mechanics help you “formally verify” that the system’s parts are going to work correctly is missing where the main benefit lies, which is in knowing what you’re doing at all, not in being able to machine-verify everything you’d like to.
Scott can correct me if I’m misunderstanding his post (e.g., rounding it off too much to what’s already in my head).
See Carl Shulman’s Risk-neutral donors should plan to make bets at the margin at least as well as giga-donors in expectation for a more thorough discussion of a few of these points (though the examples Carl cites to support his conclusion look more like “provide very early funding to new organizations” than like Ben’s particular description).
I’m not sure I understand exactly what Ben’s proposing, and I posted Ben’s view here as a discussion-starter (because I want to see it evaluated), rather than as an endorsement.
(I should also note explicitly that I’m not writing this on MIRI’s behalf or trying to make any statement about MIRI’s current room for more funding; and I should mention that Open Phil is MIRI’s largest contributor.)
But if I had said something like what Ben said, the version of the claim I’d be making is:
The primary goal is still to maximize long-term, large-scale welfare, not to improve your friends’ lives as an end in itself. But if your friends are in the EA community, or in some other community that tends to do really important high-value things, then personal financial constraints will overlap a lot with “constraints on my ability to start a new high-altruistic-value project”, “constraints on my ability to take 3 months off work to think about what new high-value projects I could start in the future”, etc.
These personal constraints are often tougher to evaluate for bigger donors like Jaan Tallinn or Dustin Moskovitz (and the organizations they use to disburse funds, like BERI and the Open Philanthropy Project), awkward for those unusually heavily scrutinized donors to justify to onlookers, or demanding of too much evaluation time given opportunity costs. The funding gaps tend to be too small to be worth the time of bigger donors, while smaller donors are in a great position to cover these gaps, particularly if they’re gaps affecting high-impact individuals the donor already knows really well.
Larger donors are in a great position to help provide large, stable long-term support to well-established projects; I take Ben to be arguing that the role of smaller donors should largely be to add enough slack to the system that high-altruistic-impact people can afford to do the early-stage work (brainstorming, experimenting with uncertain new ideas, taking time off to skill-build or retrain for a new kind of work, etc.) that will then sometimes spit out a well-established project later in the pipeline.
I take Paul Christiano’s recent experiments with impact purchases, prizes, and researcher funding to be a special case of this approach to giving: rather than trying to find a well-established project to support, try to address value that’s being lost early in the pipeline, by paying individuals to start new projects or by just giving no-strings donations to people who have a proven track record of doing really valuable things.
One effect of this is that you’re incentivizing the good accomplishments/behaviors you’re basing your donation decision on. A separate effect can be that you’re removing constraints from people who find high-value projects inherently motivating and would spend time on them by default if they could; someone who’s already sufficiently motivated by altruistic impact and doesn’t need extra financial incentive may still be cash-constrained in what useful things they can spend their time on (or pay others to do, etc.).
This approach does introduce risk of bias. In principle, though, you can try to mitigate bias for this category of decision in the same way you’d try to mitigate bias for a direct donation to a philanthropic organization. E.g., ask third parties to check your reasoning, deliberately ignore opportunities where you’re wary of your own motivations, or simply give the money to someone you trust a lot to do the donating on your behalf.
Quoting the specific definitions in the Arbital article for orthogonality, in case people haven’t seen that page (bold added):
The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.
The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal. [...]
This contrasts to inevitablist theses which might assert, for example:
“It doesn’t matter what kind of AI you build, it will turn out to only pursue its own survival as a final end.”
“Even if you tried to make an AI optimize for paperclips, it would reflect on those goals, reject them as being stupid, and embrace a goal of valuing all sapient life.” [...]
Orthogonality does not require that all agent designs be equally compatible with all goals. E.g., the agent architecture AIXI-tl can only be formulated to care about direct functions of its sensory data, like a reward signal; it would not be easy to rejigger the AIXI architecture to care about creating massive diamonds in the environment (let alone any more complicated environmental goals). The Orthogonality Thesis states “there exists at least one possible agent such that...” over the whole design space; it’s not meant to be true of every particular agent architecture and every way of constructing agents. [...]
The weak form of the Orthogonality Thesis says, “Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal.”
The strong form of Orthogonality says, “And this agent doesn’t need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal.” [...]
This could be restated as, “To whatever extent you (or a superintelligent version of you) could figure out how to get a high-U outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that terminally prefers high-U outcomes can be at least that good at achieving U.” This assertion would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic because it didn’t have any such defects. [...]
For purposes of stating Orthogonality’s precondition, the “tractability” of the computational problem of U-search should be taken as including only the object-level search problem of computing external actions to achieve external goals. If there turn out to be special difficulties associated with computing “How can I make sure that I go on pursuing U?” or “What kind of successor agent would want to pursue U?” whenever U is something other than “be nice to all sapient life”, then these new difficulties contradict the intuitive claim of Orthogonality. Orthogonality is meant to be empirically-true-in-practice, not true-by-definition because of how we sneakily defined “optimization problem” in the setup.
Orthogonality is not literally, absolutely universal because theoretically ‘goals’ can include such weird constructions as “Make paperclips for some terminal reason other than valuing paperclips” and similar such statements that require cognitive algorithms and not just results. To the extent that goals don’t single out particular optimization methods, and just talk about paperclips, the Orthogonality claim should cover them.
Personally, I like not being able to tell how many downvotes things have gotten. On the old LW, I frequently checked the percent up/down that comments and posts got, and it primed me a lot more to feel defensive or like I was in an adversarial environment. The triggered emotion is something like ‘Oh, this awesome thing inexplicably got 20% downvotes; I need to be on the lookout for bad people to push/strike back against.’
Eliezer mostly talks about the idea that ‘No literal lies’ isn’t morally necessary, but I take it from the “your sentences never provided Bayesian evidence in the wrong direction” goal that he also wouldn’t consider this morally sufficient.
I think that’s wrong.
Yeah, I agree with that; my above suggestion is taking into account that this is a likely case of overconcern.
1) Necessary: it’s unclear that a technical solution to alignment would be sufficient, since our current social institutions are not designed for superintelligent actors, and we might not develop effective new ones quickly enough
This sounds weaker to me than what I usually think of as a “necessary and sufficient” condition.
My view is more or less the one Eliezer points to here:
The big big problem is, “Nobody knows how to make the nice AI.” You ask people how to do it, they either don’t give you any answers or they give you answers that I can shoot down in 30 seconds as a result of having worked in this field for longer than five minutes.
It doesn’t matter how good their intentions are. It doesn’t matter if they don’t want to enact a Hollywood movie plot. They don’t know how to do it. Nobody knows how to do it. There’s no point in even talking about the arms race if the arms race is between a set of unfriendly AIs with no friendly AI in the mix.
And the one in the background when he says a competitive AGI project can’t deal with large slowdowns:
Because I don’t think you can get the latter degree of advantage over other AGI projects elsewhere in the world. Unless you are postulating massive global perfect surveillance schemes that don’t wreck humanity’s future, carried out by hyper-competent, hyper-trustworthy great powers with a deep commitment to cosmopolitan value — very unlike the observed characteristics of present great powers, and going unopposed by any other major government.
I would say that actually solving the technical problem clearly is necessary for good outcomes, whereas strong pre-AGI global coordination is helpful but not necessary. And the scenario where a leading AI company just builds sufficiently aligned AGI, runs it, and saves the world doesn’t strike me as particularly implausible, relative to other ‘things turn out alright’ outcomes; whereas the scenario where world leaders like Trump, Putin, and Xi Jinping usher in a permanent otherwise-utopian AGI-free world government does strike me as much crazier than the ten or hundred likeliest ‘things turn out alright’ scenarios.
In general, better coordination reduces the difficulty of the relevant technical challenges, and technical progress reduces the difficulty of the relevant coordination challenges; so both are worth pursuing. I do think that (e.g.) reducing x-risk by 5% with coordination work is likely to be much more difficult than reducing it by 5% with technical work, and I think the necessity and sufficiency arguments are much weaker for ‘just try to get everyone to be friends’ approaches than for ‘just try to figure out how to build this kind of machine’ approaches.
I would have said that strong global coordination before we get to AGI isn’t necessary. I’d also have said that strong global coordination without an alignment solution is insufficient, given that it’s not realistic to shoot for levels of coordination like “let’s just never build AGI”. (My model of Nate would also add here that never building AGI would mean losing an incredible amount of cosmopolitan value, enough to count as an existential catastrophe in its own right.)
Maybe we could start with you saying why you think it’s necessary and sufficient? That might give me a better understanding of what you have in mind by “institution-oriented work”.
I also agree that one should consider tradeoffs, sometimes. But every time someone has raised this concern to me (I think it’s been 3x?) I think it’s been a clear cut case of “why are you even worrying about that”, which leads me to believe that there are a lot of people who are overconcerned about this.
I wouldn’t be at all surprised if lots of people are overconcerned about this. Many people are also underconcerned, though. I feel better about public advice that encourages people to test their models of the size of relevant drops and relevant buckets, rather than just trying to correct for a bias some people have in a particular direction (which makes overcorrection easy).
If you’re able to contribute equally to technical safety work and institution-oriented work, my own advice would generally be to prioritize technical work. I agree with capybarelet, though, that safety researchers should be willing to do work that might synergize with capabilities research, where the tradeoff looks worth it.
On the other hand, I think “don’t worry about how your research (or other actions) will impact AGI timelines or development trajectories, because whatever you’re doing is probably a drop in the bucket” is a bad meme to propagate. Some of the buckets that matter aren’t that large, and the drops may be much larger for some of the researchers who are particularly adept at making safety breakthroughs. (And public advice should plausibly be skewed toward those people, since most of the expected impact of advice may come from its influence on large-drop people.)
I agree that motte-and-bailey would be a much better meme in the world where it were applied to individuals rather than groups 90+% of the time. I think it would still be a pretty bad meme in those worlds, for reasons related to Hazard’s comment and my pro-hypocrisy stance. I also think it’s plausibly a lot harder to stop people from applying “motte and bailey” to groups than to discourage the “motte and bailey” framing altogether.
I agree with this. (My view of ‘motte and bailey’: 1, 2, 3.)
If I’m understanding you correctly, and your point is “Toolbox thinking and lawful thinking are metatools in metatoolboxes, and should be used accordingly”, then you actually are arguing that toolbox reasoning is the universally best context-insensitive metaway to think.
Eliezer’s argument in this post is that “toolbox reasoning is the best way to think” is ambiguous between at least three interpretations:
(a) Humans shouldn’t try to base all their daily decisions on a single simple explicit algorithm.
(b) Humans should never try to think in terms of simple, all-encompassing, unconditional, exceptionless rules and patterns, or should only do so when there’s minimal risk of mistaking that rule for a simple-algorithm-you-can-base-every-decision-on.
(c) Humans should rarely try to think in terms of such rules. It’s useful sometimes, but only in weird exceptional cases.
Your point is that (a) is true, and that toolbox thinking therefore “wins”. But this depends on which interpretation we use for “toolbox thinking” — which is a question that doesn’t matter and has no right answer anyway, because “toolbox thinking” is just a phrase Eliezer made up to gesture at a possible miscommunication/confusion, and doesn’t have an established meaning.
Eliezer’s claim, if I understand him right, is that (a) is clearly true, (b) is clearly false, and (c) is very probably false. (c) is the more interesting version of the claim, and the hardest to quickly resolve, since terms like “rarely” are themselves vague and need more operationalization. But a fair number of people do reject something like (a), and a fair number of people do endorse something like (b), so we need to address those views in some way, while being careful not to weak-man people who have more credible and nuanced positions.