Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Max Harms
Here are some thoughts about the recent back-and-forth where Will MacAskill reviewed IABI and Rob Bensinger wrote a reply and Will replied back. I’m making this a quick take instead of a full post because it gets kinda inside baseball/navelgazy and I want to be more chill about that than I would be in a full writeup.
First of all, I want to say thank you to Will for reviewing IABI. I got a lot out of the mini-review, and broadly like it, even if I disagree on the bottom-line and some of the arguments. It helped me think deep thoughts.
The evolution analogy
I agree with Will that the evolution analogy is useful and informative in some ways, but of limited value. It’s imperfect, and thinking hard about the differences is good.
The most basic disanalogy is that evolution wasn’t trying, in any meaningful sense, to produce beings that maximise inclusive genetic fitness in off-distribution environments.
I agree with this, and I appreciate MacAskill talking about it.
But we will be doing the equivalent of that!
One of the big things that I think distinguishes more doomy people like me from less doomy people like Will is our priors on how incompetent people are. Like, I agree that it’s possible to carefully train an ML system to be (somewhat) robust to distributional shifts. But will we actually do that?
I think, at minimum, any plan to build AGI (to say nothing of ASI) should involve:
Trying to give the AI a clear, (relatively) simple goal that is well understood by humans and which we have strong reason to expect won’t cause a catastrophe if faithfully pursued, even if it gets wildly more power than expected and goes a little off-the-rails.
Training that AI in a wide variety of environments, trying to hit the real goal, rather than proxies. This should involve adversarial environments and a lot of paranoia that the AI has failed to robustly internalize the true goal.
Testing the AI on entirely new environments that were a-priori suspected of being difficult and weird, but where we also think there is a true answer that can be checked, restarting from scratch if the AI fails to generalize to the test set, or otherwise clearly demonstrates that it has not internalized the desired goal in a robust way.
And I personally think pure corrigibility has a nonzero chance of being a good choice of goal, and that a sufficiently paranoid training regime has a nonzero chance of being able to make a semi-safe AGI this way, even with current techniques. (That said, I don’t actually advocate for plans that have a significant chance of killing everyone, and I think “try to build corrigible AGI” does have a significant chance of killing everyone; I just notice that it seems better than what the research community currently seems to be doing, even at Anthropic.)
I predict the frontier lab that builds the first AGI will not be heavily focused on ensuring robustness to distributional shifts. We could bet, maybe.
Types of misalignment
I really benefited from this! Will changed my mind! My initial reaction to Will’s mini-review was like, “Will is wrong that these are distinct concepts; any machine sufficiently powerful to have a genuine opportunity to disempower people but which is also imperfectly aligned will produce a catastrophe.”
And then I realized that I was wrong. I think. Like, what if Will is (secretly?) gesturing at the corrigibility attractor basin or perhaps the abstracted/generalized pattern of which corrigibility is an instance? (I don’t know of other goals which have the same dynamic, but maybe it’s not just corrigibility?)
An agent which is pseudo-corrigible, and lives inside the attractor basin, is imperfectly aligned (hence the pseudo) but if it’s sufficiently close to corrigible it seems reasonable to me that it won’t disempower humanity, even if given the opportunity (at least, not in every instance it gets the opportunity). So at the very least, corrigibility (one of my primary areas of research!) is (probably) an instance of Will being right (and my past self being wrong), and the distinction between his “types of misalignment” is indeed a vital one.
I feel pretty embarrassed by this, so I guess I just wanna say oops/sorry/thanks.
If I set aside my actual beliefs and imagine that we’re going to naturally land in the corrigibility attractor basin by default, I feel like I have a better sense of some of the gradualism hope. Like, my sense is that going from pseudo-corrigible to perfectly corrigible is fraught, but can be done with slow, careful iteration. Maybe Clara Collier and other gradualists think we’re going to naturally land in the corrigibility attractor basin, and that the gradual work is the analogue of the paranoid iteration that I conceive as being the obvious next-step?
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
(Also on the topic of gradualism and the notion of having “only one try” I want to gesture at the part of IABI where it says (paraphrased from memory, sorry): if you have a clever scheme for getting multiple tries, you still only get one try at getting that scheme to work.)
appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)
I agree that “value space is big” is not a good argument, in isolation, for how likely it is for our creations to be aligned. The other half of the pincer is “our optimization pressure towards aligned goals is weak,” and without that the argument falls apart.
(Maybe we won’t be able to make deals with AIs? I agree that’s a worry; but then the right response is to make sure that we can. Won’t the superintelligence have essentially a 100% chance of taking over, if it wanted to? But that’s again invoking the “discontinuous jump to godlike capabilities” idea, which I don’t think is what we’ll get).
Here’s a plan for getting a good future:
Build ASI slowly, such that there’s some hope of being able to understand the first AI capable of a pivotal act.
The AI will want weird, alien stuff, but we’ll make sure that it’s the kind of mind that would prefer getting 50% of the stars with 99% probability to getting 100% of the stars with 70% probability.
Since we’re going slowly, we still have a 30% chance to stop it if we wanted.
We tell the AI “we’ll let you do a pivotal act and escape our clutches if you agree to only eat 50% of the stars, and help us get the other 50% as though you were aligned”
Our interpretability techniques are so good that we know whether it’s lying or whether it’s honorable and will actually give us half the stars.
Because we’re so good at reading its advanced, alien mind, it knows it needs to be honorable with us, and so is actually honorable.
The AI says “Yep, will do.”
We see it’s telling the truth when we check.
We set it free.
It gives us a good future.
I think this plan is bad because it fails the heuristic of “don’t summon demons and try to cleverly bargain with them,” but perhaps I’m being unfair.
My main criticism with “make deals with the AIs” is that it seems complex and brittle and like it depends heavily on a level of being able to read the machine’s mind that we definitely don’t currently have and might never have.
That said, I do think there’s a lot of value in being the sorts of people/groups that can make deals and be credible trade partners. Efforts to be more trustworthy and honorable and so on seem great.
suppose that all the first superintelligence terminally values is paperclips. But it’s risk-averse, in the sense that it prefers a guarantee of N resources over a 50⁄50 chance of 0 or 2N resources; let’s say it’s more risk-averse than the typical human being.
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
it would strongly prefer to cooperate with humans in exchange for, say, a guaranteed salary, rather than to take a risky gamble of either taking over the world or getting caught and shut off.
Rob wrote some counterpoints to this, but I just want to harp on it a little. Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
I can imagine a misaligned AI maybe making a deal with humans who let it out of the box in exchange for some small fraction of the cosmos (and honoring the deal; again, the hard part is that it has to know we can tell if it’s lying, and we probably can’t).
I can’t really imagine an AI that has a clear shot at taking over the world making a deal to be a meek little salary worker, even if there are risks in trying to take over. Taking over the world means, in addition to other things, being sure you won’t get shut off or replaced by some other AI or whatever.
(Though I can certainly imagine a misaligned AI convincing people (and possibly parts of itself) that it is willing to make a deal like that, even as it quietly accumulates more power.)
Their proposal
Now we’re getting into the source of the infighting, I think (just plain fighting? I think of Will as being part of my ingroup, but idk if he feels the same; Rob definitely is part of my ingroup; are they part of each other’s ingroups? Where is the line between infighting and just plain fighting?). Will seems very keen on criticizing MIRI’s “SHUT IT DOWN YOU FOOLS” strategy — mostly, it seems to me, because he sees this approach as insufficiently supportive of strategies besides shutting things down.
When Rob shared his draft of his reply to Will, I definitely noticed that it seemed like he was not responding accurately to the position that I saw in Will’s tweet. Unfortunately, I was aware that there is something of a history between Will and MIRI and I incorrectly assumed that Rob was importing true knowledge of Will’s position that I simply wasn’t aware of. I warned him that I thought he was being too aggressive, writing “I expect that some readers will be like ‘whoa why is MIRI acting like this guy is this extremist—I don’t see evidence of that and bet they’re strawmanning him’.” But I didn’t actually push back hard, and that’s on me. Apologies to Will.
(Rob reviewed a draft of this post and adds his own apologies for misunderstanding Will’s view. He adds: “My thanks to Max and multiple other MIRI people for pushing back on that part of my draft. I made some revisions in response, though they obviously weren’t sufficient!”)
I’m very glad to see in Will’s follow-up:
“I definitely think it will be extremely valuable to have the option to slow down AI development in the future,” as well as “the current situation is f-ing crazy”
I wish this had been more prominent in his mini-review, but :shrug:
I think Will and I probably agree that funding a bunch of efforts to research alignment, interpretability, etc. would be good. I’m an AI safety/alignment researcher, and I obviously do my day-to-day work with a sense that it’s valuable and a sense that more effort would also be valuable. I’ve heard multiple people (whom I respect and think are doing good work) complain that Eliezer is critical/dismissive of their work, and I wish Eliezer was more supportive of that work (while also still saying “this won’t be sufficient” if that’s what he believes, and somehow threading that needle).
I am pretty worried about false hope, though. I’m worried that people will take “there are a bunch of optimistic researchers working hard on this problem” as a sign that we don’t need to take drastic action. I think we see a bunch of this already and researchers like myself have a duty to shout “PLEASE DON’T RISK EVERYTHING! I DON’T GOT THIS!”[1] even while pursuing the least-doomed alignment strategies they know of. (I tried to thread this needle in my corrigibility research.)
Anyway, I think I basically agree with Will’s clarified position that a “kitchen-sink approach” is best, including a lot of research, as long as actually shutting down advanced training runs and pure capabilities research is in the kitchen sink. I feel worried that Will isn’t actually pushing for that in a way that I think is important (not building “It” is the safest intervention I’m aware of), but I’m also worried about my allies (people who basically agree that AI is unacceptably dangerous and that we need to take action) being unable to put forward a collective effort without devolving into squabbling about tone and strawmanning each other. :(
Anyway. Thank you again to Will and Rob. I thought both pieces were worth reading.
- ^
(Not to say that we should necessarily risk everything if alignment researchers do feel like they’ve “got this.” That’s a question worth debating in its own right. Also, it’s obviously worth noting that work that is incrementally useful but clearly insufficient to solve the entire field can still be valuable and the researcher is still allowed to say “I got this” on their little, local problems. (And they’re definitely allowed to speak up if they actually do solve the whole damn problem, of course. But they better have actually solved it!))
I think VNM is important and underrated and CAST is compatible with it. Not sure exactly what you’re asking, but hopefully that answers it. Search “VNM” on the post where I respond to existing work for more of my thoughts on the topic.
My read on what @PeterMcCluskey is trying to say: “Max’s work seems important and relevant to the question of how hard corrigibility is to get. He outlined a vision of corrigibility that, in the absence of other top-level goals, may be possible to truly instill in agents via prosaic methods, thanks to the notion of an attractor basin in goal space. That sense of possibility stands in stark opposition to the normal MIRI party-line of anti-naturality making things doomed. He also pointed out that corrigibility is likely to be a natural concept, and made significant progress in describing it. Why is this being ignored?”
If I’m right about what Peter is saying, then I basically agree. I would not characterize it as “an engineering problem” (which is too reductive) but I would agree there are reasons to believe that it may be possible to achieve a corrigible agent without a major theoretical breakthrough. (If (1) I’m broadly right, (2) anti-naturality isn’t as strong as the attractor basin in practice, and (3) I’m not missing any big complications, which is a big set of ifs that I would not bet my career on, much less the world.)
I think Nate and Eliezer don’t talk about my work out of a combination of having been very busy with the book and not finding my writing/argumentation compelling enough to update them away from their beliefs about how doomed things are because of the anti-naturality property.
I think @StanislavKrym and @Lucius Bushnaq are pointing out that I think building corrigible agents is hard and risky, and that we have a lot to learn and probably shouldn’t be taking huge risks of building powerful AIs. This is indeed my position, and does not feel contrary to or solidly addressing Peter’s points.
Lucius and @Mikhail Samin bring up anti-naturality. I wrote about this at length in CAST and basically haven’t significantly updated, so I encourage people to follow Lucius’ link if they want to read my full breakdown there. But in short, I do not feel like I have a handle on whether the anti-naturality property is a stronger repulsor than the corrigibility basin is an attractor in practice. There are theoretical arguments that pseudo-corrigible agents will become fully corrigible and arguments that they will become incorrigible and I think we basically just have to test it and (if it favors attraction) hope that this generalizes to superintelligence. (Again, this is so risky that I would much rather we not be building ASI in general.) I do not see why Nate and Eliezer are so sure that anti-naturality will dominate, and this is, I think, the central issue of confidence that Peter is trying to point at.
(Aside: As I wrote in CAST, “anti-natural” is a godawful way of saying opposed-to-the-instrumentally-convergent-drives, since it doesn’t preclude anti-natural things being natural in various ways.)
Anyone who I mischaracterized is encouraged to correct me. :)
(Minor point: I agree we’re not on track, but I was trying to include in my statement the possibility that we change track.)
Agreed. Thanks for pointing out my failing, here. I think this is one of the places in my rebuttal where my anger turned into snark, and I regret that. Not sure if I should go back and edit...
Thank you for this response. I think it really helped me understand where you’re coming from, and it makes me happy. :)
I really like the line “their case is maybe plausible without it, but I just can’t see the argument that it’s certain.” I actually agree that IABIED fails to provide an argument that it’s certain that we’ll die if we build superintelligence. Predictions are hard, and even though I agree that some predictions are easier, there’s a lot of complexity and path-dependence and so on! My hope is that the book persuades people that ASI is extremely dangerous and worth taking action on, but I’d definitely raise an eyebrow at someone who did not have Eliezer-level confidence going in, but then did have that level of confidence after reading the book.
There’s a motte argument that says “Um actually the book just says we’ll die if we build ASI given the alignment techniques we currently have” but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there’s a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
(This is why it’s important that the world invests a whole bunch more in alignment research! (...in addition to trying to slow down capabilities research.))
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying “here is the argument for why it’s obvious that ASI will kill us all” and I hear them as saying “here is the argument for why ASI will kill us all” and so you’re docking them points when they fail to reach the high standard of “this is a watertight and irrefutable proof” and I’m not?
On a different subtopic, it seems clear to me that we think about the possibility of a misaligned ASI taking over the world pretty differently. My guess is that if we wanted to focus on syncing up our worldviews, that is where the juicy double-cruxes are. I’m not suggesting that we spend the time to actually do that—just noting the gap.
Thanks again for the response!
@Max H may have a different take than mine, and I’m curious for his input, but I find myself still thinking about serial operations versus parallel operations. Like, I don’t think it’s particularly important to the question of whether AIs will think faster to ask how many transistors operating in parallel will be needed to capture the equivalent information processing of a single neuron, but rather how many serial computations are needed. I see no reason it would take that many serial operations to capture a single spike, especially in the limit of e.g. specialized chips.
Contra Collier on IABIED
Yeah, sorry. I should’ve been more clear. I totally agree that there are ways in which brains are super inefficient and weak. I also agree that on restricted domains it’s possible for current AIs to sometimes reach comparable data efficiency.
Ah, I hadn’t thought about that misreading being a source of confusion. Thanks!
Sweet. Thanks for the thoughtful reply! Seems like we mostly agree.
I don’t have a good source on data efficiency, and it’s tagged in my brain as a combination of “a commonly believed thing” and “somewhat apparent in how many epochs of training on a statement it takes to internalize it combined with how weak LLMs are at in-context learning for things like novel board games” but neither of those is very solid and I would not be that surprised to learn that humans are not more data efficient than large transformers that can do similar levels of transfer learning or something. idk.
So it sounds like your issue is not any of the facts (transistor speeds, neuron speeds, AIs faster that humans) but rather the notion that comparing clock speeds and how many times a neuron can spike in a second is not a valid way to reason about whether AI will think faster than humans?
I’m curious what sort of argument you would make to a general audience to convey the idea that AIs will be able to think much faster than humans. Like, what do you think the valid version of the argument looks like?
IABI says: “Transistors, a basic building block of all computers, can switch on and off billions of times per second; unusually fast neurons, by contrast, spike only a hundred times per second. Even if it took 1,000 transistor operations to do the work of a single neural spike, and even if artificial intelligence was limited to modern hardware, that implies human-quality thinking could be emulated 10,000 times faster on a machine— to say nothing of what an AI could do with improved algorithms and improved hardware.
@EigenGender says “aahhhhh this is not how any of this works” and calls it an “egregious error”. Another poster says it’s “utterly false.”
(Relevant online resources text.)
(Potentially relevant LessWrong post.)
I am confused what the issue is, and it would be awesome if someone can explain it to me.
Where I’m coming from, for context:
We don’t know exactly what the relevant logical operations in the human brain are. The model of the brain that says there are binary spiking neurons that have direct connections from synapse->dendrite and that those connections are akin to floating-point numerical weights is clearly a simplification, albeit a powerful one. (IIUC “neural nets” in computers discard the binary-spikes and suggest another model where the spike-rate is akin to a numerical value, which is the basic story behind “neuron activation” in a modern system. This simplification also seems powerful, though it is surely an oversimplification in some ways.)
My main issue with the source text is that it ignores what is possibly the greater bottleneck in processing speed, which is the time it takes to move information from one area to another. (If my model is right, one of the big advantages of a MoE architecture is to reduce the degree of thrashing weights across the bus to and from the GPU as much, which can be a major bottleneck.) However, on this front I think nerves are still clearly inferior to wires? Even mylenated neurons have a typical speed of only about 100 m/s, while information flows across wires at >50% the speed of light.
My read of the critics is that they aren’t objecting to the notion that clock speeds are significantly faster than neurons, but rather that comparing the two is a bad way of thinking about things. @Eigengender says “I don’t think serial computation is a sane metric here. I expect that the total computation represented in a neuron spike is much much higher than the footnote would indicate.” The “Processor clock speeds are not how fast AIs think” post says “In general, I think it’s more sensible for discussion of cognitive capabilities to focus on throughput metrics such as training compute (units of FLOP) and inference compute.”
I certainly agree that if we’re trying to evaluate power we need to consider throughput and total computation. Suppose that a synapse is not a simple numerical weight, and instead we needed to consider each dendritic neurotransmitter gate as a computational unit. This would force us to use many more FLOPs to model a synapse. But would it change the maximum speed? I agree that on a machine of a given size, if you have twice as many floating point operations to do, it will take twice as much time to get through them all. But if we consider the limit where we are not forced to do parallelizable computations in serial, I expect most of the arguments about computational richness are irrelevant?
Perhaps the critics are saying that it takes more serial computations to capture the logic of a serial firing? But the source text admits that this might be the case, and suggests that it might even require 1,000 clock cycles to handle the computations. Is it really so obvious that it takes more than 1000x serial operations to capture a single neuron?
More context: I do think that the human brain is way more powerful (and WAY more efficient) than any current AI system. The extremely crude BOTEC of comparing weights and neocortex synapses says there’s something like a 100x difference, and my guess is that the brain is doing significantly fancier things than a modern transformer, algorithmically.
And of course, training/learning speed may be much more relevant than processing speed, and AFAIK humans are just wildly more data efficient.
And of course the speed at which the AI can approximate the logical action of a single neuron being higher doesn’t imply that the AI will take less time to have each thought. It seems straightforward that machine systems will make decisions in high-speed contexts using quick pathways and they will use any extra thinking speed to think more deeply in contexts where speed isn’t important (much like humans do!).
Anyway, like I said, I’m confused. I respect IABI’s critics and am hoping to learn where my model is wrong.
I appreciate your point about this being a particularly bad place to exaggerate, given that it’s a cruxy point of divergence with our closest allies. This makes me update harder towards the need for a rewrite.
I’m not really sure how to respond to the body of your comment, though. Like, I think we basically agree on most major points. We agree on the failure mode that relevant text of The Problem is highlighting is real and important. We agree that doing Control research is important, and that if things are slow/gradual, this gives it a better chance of working. And I think we agree that it might end up being too fast and sloppy to actually save us. I’m more pessimistic about the plan of “use the critical window of opportunity to make scientific breakthroughs that save the day” but I’m not sure that matters? Like, does “we’ll have a 3 year window of working on near-human AGIs before they’re obviously superintelligent” change the take-away?
I’m also worried that we’re diverging from the question of whether the relevant bit of source text is false. Not sure what to do about that, but I thought I’d flag it.
Yep. I agree with this. As I wrote, I think it’s a key skill to manage to hold the heart of the issue in a way that is clear and raw, while also not going overboard. There’s a milquetoast failure mode and an exaggeration failure mode and it’s important to dodge both. I think the quoted text fails to thread the needle, and was agreeing with Ryan (and you) on that.
Upvoted! You’ve identified a bit of text that is decidedly hyperbolic, and is not how I would’ve written things.
Backing up, there is a basic point that I think The Problem is making, that I think is solid and I’m curious if you agree with. Paraphrasing: Many people underestimate the danger of superhuman AI because they mistakenly believe that skilled humans are close to the top of the range of mental ability in most domains. The mistake can be shown by looking at technology in general, where specialized machines are approximately always better than the direct power that individual humans can bring to bear, when machines that can do comparable work are built. (This is a broader pattern than with mental tasks, but it still applies for AI.)
The particular quoted section of text argues for this in a way that overstates the point. Phases like “routinely blow humans out of the water,” “as soon as … at all,” “vastly outstrips,” and “barely [worth] mentioning” are rhetorically bombastic and unsubtle. Reality, of course, is subtle and nuanced and complicated. Hyperbole is a sin, according to my aesthetic, and I wish the text had managed not to exaggerate.
On the other hand, smart people are making an important error that they need to snap out of and fighting words like the ones The Problem uses are helpful in foregrounding that mistake. There are, I believe many readers who would glaze over a toned-down version of the text that will correctly internalize the severity of the mistake when it’s presented in a bombastic way. Punchy text can also be fun to read, which matters.
On the other other hand, I think this is sort of what writing skill is all about? Like, can you make something that’s punchy and holds the important thing in your face in a way that clearly connects to the intense, raw danger while also being technically correct and precise? I think it’s possible! And we should be aspiring to that standard.
All that said, let’s dig into more of the object-level challenge. If I’m reading you right, you’re saying something like: AI capabilities have been growing at a pace in most domains where the time between “can do at all” and “vastly outstrips humans” takes at least years and sometimes decades, and it is importantly wrong to characterize this as “very soon afterwards.” I notice that I’m confused about whether you think this is importantly wrong in the sense of invalidating the basic point that people neglect how much room there is above humans in cognitive domains, or whether you think it’s importantly wrong because it conflicts with other aspects of the basic perspective such as takeoff speeds and the importance of slowing down before we have AGI vs muddling through. Or maybe you’re just arguing that it’s hyperbolic, and you just wish the language was softer?
On some level you’re simply right. If we think of Go engines using MCTS as being able to play “at all” in 2009, then it took around 8 years (Alpha Go Zero) to vastly outstrip any human. Chess is even more right, with human-comparable engines existing in the mid 60s and it taking ~40 years to become seriously superhuman. Essays, coding, and buying random things on the internet are obviously still comparable to humans, and have arguably been around since ~2020 (less obviously with the buying random things, but w/e). Recognizing if an image has a dog was arguably “at all” in 2012 with AlexNet, and became vastly superhuman ~2017.
On another level, I think you’re wrong. Note the use of the word “narrow domains” in the sentence before the one you quote. What is a “narrow domain”? Essay writing is definitely not narrow. Playing Go is a reasonable choice of “narrow domain,” but detecting dogs is an even better one. Suppose that you want to detect dogs for a specific task where you need <10% accuracy, and skilled humans have ~5% accuracy, when trying (ie its comparable to ImageNet). If you need <10%, then AlexNet is not able to do that narrow task! It is not “at all.” Maybe GoogLeNet counts (in 2014) or maybe Microsoft’s ResNet (in 2015). At this point you have a computer system with comparable ability to a human that is skilled at the task and trying to do it. Is AI suddenly able to vastly outstrip human ability? Yes! The AI can identify images faster, more cheaply, and with no issues of motivation or fatigue. The world suddenly went from “you basically need a human to do this task” to “obviously you want to use an AI to do this task.” One could argue that Go engines instantly went from “can’t serve as good opponents to train against” to “vastly outstripping the ability of any human to serve as a training opponent” in a similar way.
(Chess is, I think, a weird outlier due to how it was simultaneously tractable to basic search, and a hard enough domain that early computers just took a while to get good.)
Suppose that I simply agree. Should we re-write the paragraph to say something like “AI systems routinely outperform humans in narrow domains. When AIs become at all competitive with human professionals on a given task, humans usually cease to be able to compete within just a handful of years. It would be unexpected if this pattern suddenly stopped applying for all the tasks that AI can’t yet compete with human professionals on.”? Do you agree that the core point would remain, if we did that rewrite? How would you feel about a simple footnote that says “Yes, we’re being hyperbolic here, but have you noticed the skulls of people who thought machines would not outstrip humans?”
Armstrong is one of the authors on the 2015 Corrigibility paper, which I address under the Yudkowsky section (sorry, Stewart!). I also have three of his old essays listed on the 0th essay in this sequence:
“The limits of corrigibility.” 2018.
“Petrov corrigibility.” 2018.
“Corrigibility doesn’t always have a good action to take.” 2018.
While I did read these as part of writing this sequence, I didn’t feel like they were central/foundational/evergreen enough to warrant a full response. If there’s something Armstrong wrote that I’m missing or a particular idea of his that you’d like my take on, please let me know! :)
You have correctly identified that giving a corrigible superintelligence to most people will result in doom. This is why I think it’s vital that power over superintelligence be kept in the hands of a benevolent governing body. And yes, since this is probably an impossible ask, I think we should basically shut down AI development until we figure out how to select for benevolence and wisdom.
Still, I think corrigibility is a better strategy than the approaches currently being taken by frontier labs (which are even more doomed).
I just encountered this, and I really appreciate you writing it! I feel like you very much got the essence of what I was hoping to communicate. :D
My reading of the text might be wrong, but it seems like bacteria count as living beings with goals? More speculatively, possible organisms that might exist somewhere in the universe also count for the consensus? Is this right?
If so, a basic disagreement is that I don’t think we should hand over the world to a “consensus” that is a rounding error away from 100% inhuman. That seems like a good way of turning the universe into ugly squiggles.
If the consensus mechanism has a notion of power, such that creatures that are disempowered have no bargaining power in the mind of the AI, then I have a different set of concerns. But I wasn’t able to quickly determine how the proposed consensus mechanism actually works, which is a bad sign from my perspective.
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
Oh, uh, I guess @wdmacaskill and @Rob Bensinger