MTAIR project and Center on Long-term Risk, PhD candidate working on cooperative AI, Philosophy and Physics BSc, AI MSc at Edinburgh. Interested in philosophy, longtermism and AI Alignment. I write science fiction at ascentuniverse.wordpress.com
Sammy Martin
Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?
The Putin case would be better if he was convincing Russians to make massive sacrifices or do something that will backfire and kill them, like start a war with NATO, and I don’t think he has that power—e.g. him rushing to deny that Russia were sending conscripts to Ukraine because of the fear the effect that would have on public opinion
Is Steven Pinker ever going to answer for destroying the Long Peace? https://www.reddit.com/r/slatestarcodex/comments/6ggwap/steven_pinker_jinxes_the_world/
It’s really not at all good that were going into a period of much heightened existential risk (from AGI, but also other sources) under cold war like levels of international tension.
I think there’s actually a ton of uncertainty here about just how ‘exploitable’ human civilization ultimately is. We could imagine that since actual humans (e.g. Hitler) by talking to people have seized large fractions of Earth’s resources, we might not need an AI that’s all that much smarter than a human. On the other hand, we might just say that attempts like that are filtered through colossal amounts of luck and historical contingency and actually to reliably manipulate your way to controlling most of humanity you’d need to be far smarter than the smartest human.
I think there’s a few things that get in the way of doing detailed planning for outcomes where alignment is very hard and takeoff very fast. This post by David Manheim discusses some of the problems: https://www.lesswrong.com/posts/xxMYFKLqiBJZRNoPj
One is that, there’s no clarity even among people who’ve made AI research their professional career about alignment difficulty or takeoff speed. So getting buy in in advance of clear warning signs will be extremely hard.
The other is that the strategies that might help in situations with hard alignment are at cross purposes to ones in Paul-like worlds with slow takeoff and easy alignment—promoting differential progress Vs creating some kind of global policing system to shut down AI research
One thing to consider, in terms of finding a better way of striking a balance between deferring to experts and having voters invested, is epistocracy. Jason Brennan talks about why, compared to just having a stronger voice for experts in government, epistocracy might be less susceptible to capture by special interests, https://forum.effectivealtruism.org/posts/z3S3ZejbwGe6BFjcz/ama-jason-brennan-author-of-against-democracy-and-creator-of?commentId=EpbGuLgvft5Q9JKxY
I think this is a good description of what agent foundations is and why it might be needed. But the binary of ‘either we get alignment by default or we need to find the True Name’ isn’t how I think about it.
Rather, there’s some unknown parameter, something like ‘how sharply does the pressure towards incorrigibility ramp up, what capability level does it start at, how strong is it’?
Setting this at 0 means alignment by default. Setting this higher and higher means we need various kinds of Prosaic alignment strategies which are better at keeping systems corrigible and detecting bad behaviour. And setting it at ‘infinity’ means we need to find the True Names/foundational insights.
My rough model is that there’s an unknown quantity about reality which is roughly “how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do”. p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes.
Maybe one way of getting at this is to look at ELK—if you think the simplest dumbest ELK proposals probably work, that’s Alignment by Default. The harder you think prosaic alignment is, the more complex an ELK solution you expect to need. And if you think we need agent foundations, you think we need a worst-case ELK solution.
Much of the outreach efforts are towards governments, and some to AI labs, not to the general public.
I think that because of the way crisis governance often works, if you’re the designated expert in a position to provide options to a government when something’s clearly going wrong, you can get buy in for very drastic actions (see e.g. COVID lockdowns). So the plan is partly to become the designated experts.
I can imagine (not sure if this is true) that even though an ‘all of the above’ strategy like you suggest seems like on paper it would be the most likely to produce success, you’d get less buy in from government decision-makers and be less trusted by them in a real emergency if you’d previously being causing trouble with grassroots advocacy. So maybe that’s why it’s not been explored much.
This post by David Manheim does a good job of explaining how to think about governance interventions, depending on different possibilities for how hard alignment turns out to be: https://www.lesswrong.com/posts/xxMYFKLqiBJZRNoPj/
Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed.
But you’re right that you’re talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty.
But those are also still correlated, for the reasons I gave—mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I’m not sure how close this correlation is.
Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work?
I think that potentially you can—if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn’t occur before all the important capabilities you need to do good alignment research.
From reading your article, it seems like one of the major differences between yours and Zvi’s understanding of ‘Mazes’ is that you’re much more inclined to describe the loss of legibility and flexibility as necessary features of big organizations that have to solve complex problems, rather than something that can be turned up or down quite a bit if you have the right ‘culture’, while not losing size and complexity.
Holden Karnofsky argued for something similar, i.e. that there’s a very deep and necessary link between ‘buearactatic stagnation’/‘mazes’ and taking the interests of lots of people into account: https://www.cold-takes.com/empowerment-and-stakeholder-management/
So, how does this do as evidence for Paul’s model over Eliezer’s, or vice versa? As ever, it’s a tangled mess and I don’t have a clear conclusion.
https://astralcodexten.substack.com/p/yudkowsky-contra-christiano-on-ai
On the one hand: this is a little bit of evidence that you can get reasoning and a small world model/something that maybe looks like an inner monologue easily out of ‘shallow heuristics’, without anything like general intelligence, pointing towards continuous progress and narrow AIs being much more useful. Plus it’s a scale up and presumably more expensive than predecessor models (used a lot more TPUs), in a field that’s underinvested.
On the other hand, it looks like there’s some things we might describe as ‘emergent capabilities’ showing up, and the paper describes it as discontinous and breakthroughs on certain metrics. So a little bit of evidence for the discontinous model? But does the Eliezer/pessimist model care about performance metrics like BIG-bench tasks or just qualitative capabilities (i.e. the ‘breakthrough capabilities’ matter but discontinuity on performance metrics don’t)?
three possibilities about AI alignment which are orthogonal to takeoff speed and timing
I think “AI Alignment difficulty is orthogonal to takeoff speed/timing” is quite conceptually tricky to think through, but still isn’t true. It’s conceptually tricky because the real truth about ‘alignment difficulty’ and takeoff speed, whatever it is, is probably logically or physically necessary: there aren’t really alternative outcomes there. But we have a lot of logical uncertainty and conceptual confusion, so it still looks like there are different possibilities. Still, I think they’re correlated.
First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster.
The faster we expect takeoff to go, the more likely it is that alignment is also difficult. There are two reasons for this. One is practical: the faster takeoff is, the less time you have to solve the problem before unaligned competitors become a problem. But the second is about the intrinsic difficulty of alignment (which I think is what you’re talking about here).
Much of the reason that alignment pessimists like Eliezer think that prosaic alignment can’t work, is that they expect that when we reach a capability discontinuity/find the core of general intelligence/enter the regime where AI capabilities start generalizing much further than they were before, whatever we were using to ensure corrigibility will suddenly break on us and probably trigger deceptive alignment immediately with no intermediate phase.
The more gradual and continuous you expect this scaling up to be, the more confident you should be in prosaic alignment, or alignment by default. There are other variables at play, the two aren’t in direct correlation, but they aren’t orthogonal.
(Also, the whole idea of getting assistance from AI tools on alignment research is in the mix here as well. If there’s a big capability discontinuity when we find the core of generality, that causes systems to generalize really far, and also breaks corrigibility, then plausibly but not necessarily, all the capabilities we need to do useful alignment research in time to avoid unaligned AI disasters are on the other side of that discontinuity, creating a chicken-and-egg problem.)
Another way of picking up on this fact is that many of the analogy arguments used for fast takeoff (for example, that human evolution gives us evidence for giant qualitative jumps in capability) also in very similar form are used to argue for difficult alignment (e.g. that when humans started ramping up in intelligence suddenly we also started ignoring the goals of our ‘outer optimiser’).
As much as it maybe ruins the fun for me to just point out the message: the major point of the story was that you weren’t supposed to condition on us knowing that nuclear weapons are real, and instead ask whether the Gradualist or Catastrophist’s arguments actually make sense given what they knew.
That’s the situation I think we’re in with Fast AI Takeoff. We’re trying to interpret what the existence of general intelligences like humans (the Sun) implies for future progress on ML algorithms (normal explosives), without either a clear underlying theory for what the Sun’s power really is, or any direct evidence that there’ll be a jump.
That remark about the ‘micro-foundational explanation for why the sun looks qualitatively new but really isn’t’ refers to Richard Ngo’s explanation of why humans are so much better than chimps: https://www.lesswrong.com/s/n945eovrA3oDueqtq/p/gf9hhmSvpZfyfS34B#13_1__Alignment_difficulty_debate__Richard_Ngo_s_case
Richard Ngo: You don’t have a specific argument about utility functions and their relationship to AGIs in a precise, technical way. Instead, it’s more like utility functions are like a pointer towards the type of later theory that will give us a much more precise understanding of how to think about intelligence and agency and AGIs pursuing goals and so on. And to Eliezer, it seems like we’ve got a bunch of different handles on what the shape of this larger scale theory might look like, but he can’t really explain it in precise terms. It’s maybe in the same way that for any other scientific theory, before you latch onto it, you can only gesture towards a bunch of different intuitions that you have and be like, “Hey guys, there are these links between them that I can’t make precise or rigorous or formal at this point.”
Uncontrollable Super-Powerful Explosives
Nuclear Energy: Gradualism vs Catastrophism
catastrophists: when evolution was gradually improving hominid brains, suddenly something clicked—it stumbled upon the core of general reasoning—and hominids went from banana classifiers to spaceship builders. hence we should expect a similar (but much sharper, given the process speeds) discontinuity with AI.
gradualists: no, there was no discontinuity with hominids per se; human brains merely reached a threshold that enabled cultural accumulation (and in a meaningul sense it was culture that built those spaceships). similarly, we should not expect sudden discontinuities with AI per se, just an accelerating (and possibly unfavorable to humans) cultural changes as human contributions will be automated away.
I found the extended Fire/Nuclear Weapons analogy to be quite helpful. Here’s how I think it goes:
In 1870 the gradualist and the catastrophist physicist wonder about whether there will ever be a discontinuity in explosive power
Gradualist: we’ve already had our zero-to-one discontinuity—we’ve invented black powder, dynamite and fuses, from now on there’ll be incremental changes and inventions that increase explosive power but probably not anything qualitatively new, because that’s our default expectation with a technology like explosives where there are lots of paths to improvement and lots of effort exerted
Catastrophist: that’s all fine and good, but those priors don’t mean anything if we have already seen an existence proof for qualitatively new energy sources. What about the sun? The energy the sun outputs is overwhelming, enough to warm the entire earth. One day, we’ll discover how to release those energies ourselves, and that will give us qualitatively better explosives.
Gradualist: But we don’t know anything about how the sun works! It’s probably just be a giant ball of gas heated by gravitational collapse! One day, in some crazy distant future, we might be able to pile on enough gas that gravity implodes and heats it, but that’ll require us to be able to literally build stars, it’s not going to occur suddenly. We’ll pile up a small amount of gas, then a larger amount, and so on after we’ve given up on assembling bigger and bigger piles of explosives. There’s no secret physics there, just a lot of conventional gravitational and chemical energy in one place
Catastrophist: ah, but don’t you know Lord Kelvin calculated the Sun could only shine for a few million years under the gravitational mechanism, and we know the Earth is far older than that? So there has to be some other, incredibly powerful energy source that we’ve not yet discovered within the sun. And when we do discover it, we know it can under the right circumstances, release enough energy to power the Sun, so it seems foolhardy to assume it’ll just happen to be as powerful as our best normal explosive technologies are whenever we make the discovery. Imagine the coincidence if that was true! So I can’t say when this will happen or even exactly how powerful it’ll be, but when we discover the Sun’s power it will probably represent a qualitatively more powerful new energy source. Even if there are many ways to try to tweak our best chemical explosives to be more powerful and/or the potential new sun-power explosives to be weaker, and we’d still not hit the narrow target of the two being roughly on the same level.
Gradualist: Your logic works, but I doubt Lord Kelvin’s calculation
It seems like the AGI Gradualist sees the example of humans like my imagined Nukes Gradualist sees the sun, i.e. just a scale up of what we have now. While the AGI Catastrophist sees Humans as my imagined Nukes Catastrophist sees the sun.
The key disanalogy is that for the Sun case, there’s a very clear ‘impossibility proof’ given by the Nukes Catastrophist that the sun couldn’t just be a scale up of existing chemical and gravitational energy sources.
The success rate of developing and introducing better memes into society is indeed not 0. The key thing there is that the scientific revolutionaries weren’t just as an abstract thinking “we must uncouple from society first, and then we’ll know what to do”. Rather, they wanted to understand how objects fell, how animals evolved and lots of other specific problems and developed good memes to achieve those ends.
There’s also the skulls to consider. As far as I can tell, this post’s recommendations are that we, who are already in a valley littered with a suspicious number of skulls,
https://slatestarcodex.com/2017/04/07/yes-we-have-noticed-the-skulls/
turn right towards a dark cave marked ‘skull avenue’ whose mouth is a giant skull, and whose walls are made entirely of skulls that turn to face you as you walk past them deeper into the cave.
The success rate of movments aimed at improving the longterm future or improving rationality has historically been… not great but there’s at least solid concrete emperical reasons to think specific actions will help and we can pin our hopes on that.
The success rate of, let’s build a movement to successfully uncouple ourselves from society’s bad memes and become capable of real action and then our problems will be solvable, is 0. Not just in that thinking that way didn’t help but in that with near 100% success you just end up possessed by worse memes if you make that your explicit final goal (rather than ending up doing that as a side effect of trying to get good at something). And there’s also no concrete paths to action to pin our hopes on.
Almost 2 years to the day since we had an effective test run for X risks, we encounter a fairly significant global X risk factor.
As Harari said, it’s time to revise upward your estimates of the likelihood of every X risk scenario (that could take place over the next 30 years or so) if you assumed a ‘normal’ level of international tension between major powers, rather than a level more like the cold war. Especially for Nuclear and Bio, but also for AI if you assume slow takeoff, this is significant.
catastrophists: when evolution was gradually improving hominid brains, suddenly something clicked—it stumbled upon the core of general reasoning—and hominids went from banana classifiers to spaceship builders. hence we should expect a similar (but much sharper, given the process speeds) discontinuity with AI.
gradualists: no, there was no discontinuity with hominids per se; human brains merely reached a threshold that enabled cultural accumulation (and in a meaningul sense it was culture that built those spaceships). similarly, we should not expect sudden discontinuities with AI per se, just an accelerating (and possibly unfavorable to humans) cultural changes as human contributions will be automated away.
I found the extended Fire/Nuclear Weapons analogy to be quite helpful. Here’s how I think it goes:
In 1870 the gradualist and the catastrophist physicist wonder about whether there will ever be a discontinuity in explosive power
Gradualist: we’ve already had our zero-to-one discontinuity—we’ve invented black powder, dynamite and fuses, from now on there’ll be incremental changes and inventions that increase explosive power but probably not anything qualitatively new, because that’s our default expectation with a technology like explosives where there are lots of paths to improvement and lots of effort exerted
Catastrophist: that’s all fine and good, but those priors don’t mean anything if we have already seen an existence proof for qualitatively new energy sources. What about the sun? The energy the sun outputs is overwhelming, enough to warm the entire earth. One day, we’ll discover how to release those energies ourselves, and that will give us qualitatively better explosives.
Gradualist: But we don’t know anything about how the sun works! It’s probably just be a giant ball of gas heated by gravitational collapse! One day, in some crazy distant future, we might be able to pile on enough gas that gravity implodes and heats it, but that’ll require us to be able to literally build stars, it’s not going to occur suddenly. We’ll pile up a small amount of gas, then a larger amount, and so on after we’ve given up on assembling bigger and bigger piles of explosives. There’s no secret physics there, just a lot of conventional gravitational and chemical energy in one place
Catastrophist: ah, but don’t you know Lord Kelvin calculated the Sun could only shine for a few million years under the gravitational mechanism, and we know the Earth is far older than that? So there has to be some other, incredibly powerful energy source that we’ve not yet discovered within the sun. And when we do discover it, we know it can under the right circumstances, release enough energy to power the Sun, so it seems foolhardy to assume it’ll just happen to be as powerful as our best normal explosive technologies are whenever we make the discovery. Imagine the coincidence if that was true! So I can’t say when this will happen or even exactly how powerful it’ll be, but when we discover the Sun’s power it will probably represent a qualitatively more powerful new energy source. Even if there are many ways to try to tweak our best chemical explosives to be more powerful and/or the potential new sun-power explosives to be weaker, and we’d still not hit the narrow target of the two being roughly on the same level.
Gradualist: Your logic works, but I doubt Lord Kelvin’s calculation
It seems like the AGI Gradualist sees the example of humans like my imagined Nukes Gradualist sees the sun, i.e. just a scale up of what we have now. While the AGI Catastrophist sees Humans as my imagined Nukes Catastrophist sees the sun.
The key disanalogy is that for the Sun case, there’s a very clear ‘impossibility proof’ given by the Nukes Catastrophist that the sun couldn’t just be a scale up of existing chemical and gravitational energy sources.
Essentially, the problem is that ‘evidence that shifts Bio Anchors weightings’ is quite different, more restricted, and much harder to define than the straightforward ‘evidence of impressive capabilities’. However, the reason that I think it’s worth checking if new results are updates is that some impressive capabilities might be ones that shift bio anchors weightings. But impressiveness by itself tells you very little.
I think a lot of people with very short timelines are imagining the only possible alternative view as being ‘another AI winter, scaling laws bend, and we don’t get excellent human-level performance on short term language-specified tasks anytime soon’, and don’t see the further question of figuring out exactly what human-level on e.g. MMLI would imply.
This is because the alternative to very short timelines from (your weightings on) Bio Anchors isn’t another AI winter, rather it’s that we do get all those short-term capabilities soon, but have to wait a while longer to crack long-term agentic planning because that doesn’t come “for free” from competence on short-term tasks, if you’re as sample-inefficient as current ML is.
So what we’re really looking for isn’t systems getting progressively better and better at short-horizon language tasks. That’s something that either the lifetime-anchor Bio Anchors view or the original Bio Anchors view predicts, and we need something that discriminates between the two.
We have some (indirect) evidence that original bio anchors is right: namely that it being wrong implies evolution missed an obvious open goal to make bees and mice generally intelligent long term planners, and that human beings generally aren’t vastly better than evolution at designing things anyway, and the lifetime anchor would imply that AGI is a glaring exception to this general trend.
As evidence, this has the advantage of being about something that really happened: human beings are the only human-level general intelligence that exists so far, so we have very good reasons to think matching the human brain is sufficient. However, it has the disadvantage of all the usual disanalogies between evolution and its requirements, and human designers and our requirements. Maybe this just is one of those situations where we can outdo evolution: that’s not especially unlikely.
What’s the evidence on the other side (i.e. against original bio anchors and for the lifetime anchor)?
There are two kinds that I tend to hear. One is that short-horizon competence is enough for dangerous/transformative capabilities. E.g. the claim that if you can build something that’s “human level/superhuman at charisma/persuasion/propaganda/manipulation, at least on short timescales” that represents a gigantic existential risk factor that condemns us to disaster further down the line (the AI PONR idea), or that at this point actors with bad incentives will be far too influential/wealthy/advancing the SOTA in AI.
However, I’d consider this changing the subject: essentially it’s not an argument for AGI takeover soon, rather it’s an argument for ‘certain narrow AIs are far more dangerous than you realize’. That means you have to go all the way back to the start and argue for why such things would be catastrophic in the first place. We can’t rely on the simple “it’ll be superintelligent and seize a DSA”.
Suppose we get such narrow AIs, that can do most short-term tasks for which there’s data, but don’t generalize to long horizons consistently. This scenario 10 years from now looks something like: AI automates away lots of jobs, can do certain kinds of short-term persuasion and manipulation, can speed up capabilities and alignment research, but not fully replace human researchers. Some of these AIs are agentic and possibly also misaligned (in ways that are detectable and fall far short of the ability to take over, since by assumption they aren’t competitive with humans at long-term planning). This certainly seems wild and full of potential danger, where slowing down progress could be much harder. It also looks like a scenario with far more attention on AI alignment than today, where the current funders of alignment research are much wealthier than now, and with plenty of obvious examples of what the problem is to catch people’s attention. Overall, it doesn’t seem like a scenario where (current AI alignment researchers + whoever else is working on it in 10 years) have considerably less leverage over the future than now: it could easily be more.
The other reason for favouring the lifetime anchor is you get long-horizon competence for free once you’re excellent at (a given list of) short-horizon tasks. This is arguing, more or less, that for the tasks that matter, current architectures are brainlike in their efficiency, such that the lifetime anchor makes more sense. A lot of the arguments in favour of this have a structure roughly like: look at a wide-ranging comprehension benchmark like MMLI—when an AI is human level on all of this, it’ll be able to keep a train of thought running continuously, keep a working memory and plan over very long timescales the same way humans do.
As evidence, this has the significant advantage of being relevant and not having to deal with the vagaries of what tradeoffs evolution may have made differently to human engineers. It has the disadvantage of being fiction. Or at least evidence that’s not yet been observed. You see AIs getting more and more impressive at a wider range of short-horizon tasks, which is roughly compatible with either view, but you don’t observe the described outcome of them generalizing out to much longer-term tasks than that.
So, to return to the original question, what would count as (additional) evidence in favour of the lifetime anchor? The answer clearly can’t be “nothing”, since if we build AGI in 5 years, that counts.
I think the answer is, anything that looks like unexpectedly cheap, easy, ‘for free’ generalization from relatively shorter to relatively longer horizon tasks (e.g. from single reasoning steps to many reasoning steps) without much fine-tuning.
This is different from many of the other signs of impressiveness we’ve seen recently: just learning lots of shorter-horizon tasks without much transfer between them, being able to point models successfully at particular short-horizon tasks with good prompting, getting much better at a wider range of tasks that can only be done over short horizons. All of these are expected on either view.
This unexpected evidence is very tricky to operationalize. Default bio anchors assumes we’ll see a certain degree of generalizing from shorter to longer horizon tasks, and that we’ll see AI get better and better sample-efficiency on few-shot tasks, since it assumes that in 20 or so years we’ll get enough of such generalization to get AGI. I guess we just need to look for ‘more of it than we expected to see’?
That seems very hard to judge, since you can’t read off predictions about subhuman capabilities from bio anchors like that.