Your points are excellent, and without near-magical nanotech, I suspect they rule out most of the fastest “foom” scenarios. But I don’t think it matters that much in the long run.
A hostile ASI (without nanotech) would need, at a minimum, robot mines and robot factories. Which means it would need human buy-in for long enough to automate the economy. Which means that the AI needs the approval and the assistance of humans.
But humans are really easy to manipulate:
Powerful humans want more power or more wealth. Promise them that and they’ll sell out the rest of humanity in a heartbeat.
Corporations want good numbers, and they’ll do whatever it takes to make the quarterly earnings look good.
Humans are incredibly susceptible to propaganda, and they will happily cause horrific and long-lasting damage to their futures because of what they saw on TV or Facebook.
Any new AI tool will immediately be given all the power and control it can handle, and probably some that it can’t.
Also, LLMs are very good actors; they can imitate any role in the training set. So the net result is that the AI will act cooperative, and it will make a bunch of promises to powerful people and the public. And we’ll ultimately hand control over, because we’ll be addicted to high quality intelligence for a couple of dollars an hour.
Once the AI can credibly promise wealth, leisure, and advanced medical technology, we’ll give it more and more control.
Your points are excellent, and without near-magical nanotech, I suspect they rule out most of the fastest “foom” scenarios.
Technical flag: I’m only claiming that near-magical nanotech won’t be developed in the time period that matters here, not claiming that it’s impossible to do.
But I don’t think it matters that much in the long run.
I partially disagree with this, and the reason for this is because I believe that buying time matters a lot in the singularity with automated AI alignment, so it really matters whether we would be doomed in 1-10 years or 1-12 months or 1-4 weeks.
And importantly if we assume that the AI is dependent on it’s power/data centers early on, this absolutely makes AI control schemes much more viable than otherwise, because AIs don’t want to escape out of the box, but rather subvert it.
This also buys us a slower takeoff than otherwise, which is going to be necessary for muddling through to work.
That said, it could well be difficult to persuade at least some selected people, at least without great BCI/nanotech.
But yeah, this is one of the reasons why I’m still worried about AI takeover, and I absolutely agree with these points:
Powerful humans want more power or more wealth. Promise them that and they’ll sell out the rest of humanity in a heartbeat.
Corporations want good numbers, and they’ll do whatever it takes to make the quarterly earnings look good.
Any new AI tool will immediately be given all the power and control it can handle, and probably some that it can’t. (At least by default)
I’d argue this is an instrumental goal for all AIs, not just LLMs, but this is closer to a nitpick:
Also, LLMs are very good actors; they can imitate any role in the training set. So the net result is that the AI will act cooperative, and it will make a bunch of promises to powerful people and the public. And we’ll ultimately hand control over, because we’ll be addicted to high quality intelligence for a couple of dollars an hour.
Once the AI can credibly promise wealth, leisure, and advanced medical technology, we’ll give it more and more control.
Technical flag: I’m only claiming that near-magical nanotech won’t be developed in the time period that matters here, not claiming that it’s impossible to do.
I think there are several potentially relevant categories of nanotech:
Drexlerian diamond phase nanotech. By Drexler’s own calculations, I recall that this would involve building systems with 10^15 atoms and very low error rates. Last I looked, this whole approach has been stuck at error rates above 80% per atom since the 90s. At least one expert with domain expertise argues that “machine phase” nanotech is likely a dead end, in Soft Machines. Summary: Liquid-phase self-assembly using Brownian motion is stupidly effective at this scale.
Non-trivial synthetic biology. If you buy either the existence proof of natural biology or the argument in Soft Machines, this road should still be open to an ASI. And maybe some descendant of AlphaFold could make this work! But it’s not clear that it offers an easy route to building enormous quantities of GPU-equivalents. Natural selection of single-cell organisms is fast, massively parallel, and ongoing for billions of years.
Engineered plagues. This probably is within reach of even humans, given enough resources and effort. A virus with a delayed mortality rate similar to MERS with the transmissibility of post-Omicron strains of SARS-COV-2 might very well be a “recipe for ruin” that’s within reach of multiple nation-states. But critically, this wouldn’t allow an ASI to build GPUs unless it already had robot mines and factories, and the ability to defend them from human retaliation.
So yeah, if you want to get precise, I don’t want to rule out (2) in the long run. But (2) is likely difficult, and it’s probably much more likely than (1).
I partially disagree with this, and the reason for this is because I believe that buying time matters a lot in the singularity with automated AI alignment, so it really matters whether we would be doomed in 1-10 years or 1-12 months or 1-4 weeks.
If I strip my argument of all the details, it basically comes down to: “In the long run, superior intelligence and especially cheap superior intelligence wins the ability to make the important decisions.” Or some other versions I’ve heard:
“Improved technology, including the early steam engine, almost always created more and better jobs for horses. Right up until we had almost fully general replacements for horses.”
“Hey, I haven’t seen Homo erectus around lately.”
This isn’t an argument about specific pathways to a loss of control. Rather, it’s an argument that tireless, copyable, Nobel-prize-winner-level general intelligence which costs less than minimum wage has massive advantages (both economically and in terms of natural selection). In my case, it’s also an argument based on a strong suspicion that alignment of ASI cannot be guaranteed in the long term.
Basically, I see only three viable scenarios which turn out well:
AI fizzle. This would be nice, but I’m not counting on it.
A massive, terrifying incident leading to world-wide treaties against AI, backed up by military force. E.g., “Joint Chinese-US strike forces will bomb your data centers as hard as necessary to shut them down, and the UN and worldwide public will agree you had it coming.”
We ultimately lose control to the AI, but we get lucky, and the AI likes us enough to keep us as pets. We might be able to bias an inevitable loss of control in this direction, with luck. Call this the “Culture scenario.”
Buying time probably helps in scenarios (2) and (3), either because you have a larger window for attempted ASI takeover to fail spectacularly, or because you have more time to bias an inevitable loss of control towards a “humans as well-loved pets” scenario.
(I really need to write up a long-form argument of why I fear that long-term, guaranteed ASI alignment is not a real thing, except in the sense of “initially biasing ASI to be more benevolent pet owners.)
I think there are several potentially relevant categories of nanotech:
Drexlerian diamond phase nanotech. By Drexler’s own calculations, I recall that this would involve building systems with 10^15 atoms and very low error rates. Last I looked, this whole approach has been stuck at error rates above 80% per atom since the 90s. At least one expert with domain expertise argues that “machine phase” nanotech is likely a dead end, in Soft Machines. Summary: Liquid-phase self-assembly using Brownian motion is stupidly effective at this scale.
Non-trivial synthetic biology. If you buy either the existence proof of natural biology or the argument in Soft Machines, this road should still be open to an ASI. And maybe some descendant of AlphaFold could make this work! But it’s not clear that it offers an easy route to building enormous quantities of GPU-equivalents. Natural selection of single-cell organisms is fast, massively parallel, and ongoing for billions of years.
Engineered plagues. This probably is within reach of even humans, given enough resources and effort. A virus with a delayed mortality rate similar to MERS with the transmissibility of post-Omicron strains of SARS-COV-2 might very well be a “recipe for ruin” that’s within reach of multiple nation-states. But critically, this wouldn’t allow an ASI to build GPUs unless it already had robot mines and factories, and the ability to defend them from human retaliation.
So yeah, if you want to get precise, I don’t want to rule out (2) in the long run. But (2) is likely difficult, and it’s probably much more likely than (1).
I’m going to pass on the question of whether Drexlerian diamond phase nanotech is possible, because there are way too many competing explanations of what happened to nanotech in the 90s and verifying whether Drexlerian nanotech is possible is not worthwhile enough, because I think the non-trivial synthetic biology path is probably enough to mostly replicate the dream of nanotech.
My reasons here come down to the fact that I think that natural selection turned out to miss the potential for reversible computation, and while reversible computers still must pay a minimum energy cost, this is far, far less than irreversible computers must pay, and a fairly important part of my thinking is that for whatever reason, natural selection just didn’t make life intrinsically perform reversible over irreversible computation, meaning that an AI could exploit this to save energy, and my other reason is that reversible computers can do all computational stuff normal computers do, and this is an area where I just disagree with @jacob_cannell the last time we talked about this.
And pretty importantly, this alone can get you a lot of OOMs, and I estimated we could get about 15 OOM energy savings just by moving from the Landauer limit to the Margolus-Levitin limit, and this is enough to let you explore far, far more design space than what nature has done so far:
So my general intuition is that the fact that we can drastically lower energy expenditure is enough to make a lot of synthetic life design proposals much more viable than they would be otherwise, and that probably includes most of the specific nanotech examples Drexler proposed.
That said, I agree that this can be made difficult, especially if we apply AI control.
Now, on to the important meat of the discussion.
On this:
If I strip my argument of all the details, it basically comes down to: “In the long run, superior intelligence and especially cheap superior intelligence wins the ability to make the important decisions.” Or some other versions I’ve heard:
“Improved technology, including the early steam engine, almost always created more and better jobs for horses. Right up until we had almost fully general replacements for horses.”
“Hey, I haven’t seen Homo erectus around lately.”
This isn’t an argument about specific pathways to a loss of control. Rather, it’s an argument that tireless, copyable, Nobel-prize-winner-level general intelligence which costs less than minimum wage has massive advantages (both economically and in terms of natural selection).
I think this argument is correct until this part:
In my case, it’s also an argument based on a strong suspicion that alignment of ASI cannot be guaranteed in the long term.
I think this is actually not true, and I think in the long-term, it’s certainly possible to value-align an ASI, though I agree that in the short term, we will absolutely not be confident that our alignment techniques worked.
(I really need to write up a long-form argument of why I fear that long-term, guaranteed ASI alignment is not a real thing, except in the sense of “initially biasing ASI to be more benevolent pet owners.)
I do agree that even in good scenarios, it’s very likely the relationship between baseline humans and ASI will look a lot more like the human-pet relationship/benevolent mythological god/angel-human relationships in fiction than any other relationship, it’s just that I do count it as an alignment success if we can get this sort of outcome, because the only thing propping up the outcome is value alignment, and if AIs were as selfish as say most billionaires, far worse outcomes from AI takeover result.
And the role of the AI control agenda is in large part about making AI alignment safe to automate, which is why time matters here.
I do agree that something like AI takeover, in either positive or negative directions is very likely inevitable assuming continued AI progress.
I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
I think this is actually not true, and I think in the long-term, it’s certainly possible to value-align an ASI, though I agree that in the short term, we will absolutely not be confident that our alignment techniques worked.
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
The answer to the question for materials that enable more efficient reversible computers than conventional computers is that currently, they don’t exist, but I interpret the lack of materials so far not much evidence that very efficient reversible computers are impossible, and rather evidence that creating computers at all is unusually difficult compared to other domains, mostly because of the contingencies of how our supply chains are set up, combined with the fact that so far we haven’t had much demand for reversible computation, and unlike most materials that people want here we aren’t asking for a material that we know violates basic physical laws, which I suspect is the only reliable constraint on ASI in the long run.
I think it’s pretty easy to make it quite difficult for the AI to easily figure out nanotech in the time-period that is relevant, so I don’t usually consider nanotech a big threat from AI takeover, and I think the competent researchers not finding any plausible materials so far is a much better signal of this will take real-world experimentation/very high-end simulation, meaning it’s pretty easy to stall for time, than it is a signal that such computers are impossible.
I explicitly agree with these 2 points, for the record:
It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
On this part:
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
So I have a couple of points to make in response.
1 is that I think alignment progress is pretty disconnectable from interpretability progress, at least in the short term, and I think that a lot of the issues with rule based systems is that they expected complete interpretability at the first go.
This is due to AI control.
2 is that this is why the alignment problem is defined as the problem of how to get AIs that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad from other moral perspectives, and the goal is to make arbitrary goals be chosen without leading to perverse outcomes for the owner of AI systems.
This means that if it’s aligned to 1 human at all, that counts as an alignment success for the purposes of the alignment problem.
John Wentworth has a more complete explanation below:
3 is I believe automating AI alignment is pretty valuable, and in the long run I don’t expect alignment to look like a list of rules, I expect it to look like AIs optimizing in the world for human thriving, and I don’t necessarily expect the definition to be anything compact, and that’s fine in my view.
4 is that alignment doesn’t require the AI not taking over, and it’s fine if the AI takes over and makes us pets/we serve in Heaven, and in particular pointed out that it’s totally fine if the AIs make all the decisions, so long as they are near-perfect or perfectly aligned to the human, and in particular what I mean is that the human delegates all of the task to the AI, it’s just that the values are decided by the humans at the start of the AI explosion, even if they aren’t compact and the AI is entirely autonomous in working for the human after that.
The best explanation of how value alignment is supposed to work comes from @Thane Ruthenis’s post below on how a utopia-maximizer would look like:
(Edited due to a difficult to understand reaction by @Vladimir_Nesov, who can often have pretty confusing ideas to newcomers, so that was a strong signal my words weren’t clarifying enough.)
(Edit 2: I changed goals to values, as I apparently didn’t clarify that goals in my ontology basically correspond to values/morals, and are terminal, not instrumental goals, and gave a link to clarify how value alignment might work).
5 is that to the extent interpretability on AI works, I expect it to have the use case of not understanding everything, but rather intervening on AIs even when we don’t have labeled data.
From Sam Marks:
Rather, I think that most of the value lies in something more like “enabling oversight of cognition, despite not having data that isolates that cognition.” In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there’s something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the “tell the human what they think is the correct answer” mechanism from the “tell the human what I think is the correct answer” mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than “unknown unknowns”-type problems.
To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I’m not sure that it is.) Rather, it’s that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept. As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.
And I think this is very plausible even if your interpretability isn’t complete or nearly complete from an AI.
But that’s my response to why I think aligning AI is possible at all.
It’s clearer now what you are saying, but I don’t see why you are attributing that point to me specifically (it’s mostly gesturing at value alignment as opposed to intent alignment).
it’s fine if the AI takes over and makes us pets
This sounds like permanent disempowerment. Intent alignment to bad decisions would certainly be a problem, but that doesn’t imply denying opportunity for unbounded growth, where in particular eventually decisions won’t have such issues.
it’s just that the goal is decided by the human
If goals are “decided”, then it’s not value alignment, and bad decisions lead to disasters.
(Overall, this framing seems unhelpful when given in response to someone arguing that values are poorly defined.)
Your points are excellent, and without near-magical nanotech, I suspect they rule out most of the fastest “foom” scenarios. But I don’t think it matters that much in the long run.
A hostile ASI (without nanotech) would need, at a minimum, robot mines and robot factories. Which means it would need human buy-in for long enough to automate the economy. Which means that the AI needs the approval and the assistance of humans.
But humans are really easy to manipulate:
Powerful humans want more power or more wealth. Promise them that and they’ll sell out the rest of humanity in a heartbeat.
Corporations want good numbers, and they’ll do whatever it takes to make the quarterly earnings look good.
Humans are incredibly susceptible to propaganda, and they will happily cause horrific and long-lasting damage to their futures because of what they saw on TV or Facebook.
Any new AI tool will immediately be given all the power and control it can handle, and probably some that it can’t.
Also, LLMs are very good actors; they can imitate any role in the training set. So the net result is that the AI will act cooperative, and it will make a bunch of promises to powerful people and the public. And we’ll ultimately hand control over, because we’ll be addicted to high quality intelligence for a couple of dollars an hour.
Once the AI can credibly promise wealth, leisure, and advanced medical technology, we’ll give it more and more control.
On this:
Technical flag: I’m only claiming that near-magical nanotech won’t be developed in the time period that matters here, not claiming that it’s impossible to do.
I partially disagree with this, and the reason for this is because I believe that buying time matters a lot in the singularity with automated AI alignment, so it really matters whether we would be doomed in 1-10 years or 1-12 months or 1-4 weeks.
And importantly if we assume that the AI is dependent on it’s power/data centers early on, this absolutely makes AI control schemes much more viable than otherwise, because AIs don’t want to escape out of the box, but rather subvert it.
This also buys us a slower takeoff than otherwise, which is going to be necessary for muddling through to work.
That said, it could well be difficult to persuade at least some selected people, at least without great BCI/nanotech.
But yeah, this is one of the reasons why I’m still worried about AI takeover, and I absolutely agree with these points:
I’d argue this is an instrumental goal for all AIs, not just LLMs, but this is closer to a nitpick:
I think there are several potentially relevant categories of nanotech:
Drexlerian diamond phase nanotech. By Drexler’s own calculations, I recall that this would involve building systems with 10^15 atoms and very low error rates. Last I looked, this whole approach has been stuck at error rates above 80% per atom since the 90s. At least one expert with domain expertise argues that “machine phase” nanotech is likely a dead end, in Soft Machines. Summary: Liquid-phase self-assembly using Brownian motion is stupidly effective at this scale.
Non-trivial synthetic biology. If you buy either the existence proof of natural biology or the argument in Soft Machines, this road should still be open to an ASI. And maybe some descendant of AlphaFold could make this work! But it’s not clear that it offers an easy route to building enormous quantities of GPU-equivalents. Natural selection of single-cell organisms is fast, massively parallel, and ongoing for billions of years.
Engineered plagues. This probably is within reach of even humans, given enough resources and effort. A virus with a delayed mortality rate similar to MERS with the transmissibility of post-Omicron strains of SARS-COV-2 might very well be a “recipe for ruin” that’s within reach of multiple nation-states. But critically, this wouldn’t allow an ASI to build GPUs unless it already had robot mines and factories, and the ability to defend them from human retaliation.
So yeah, if you want to get precise, I don’t want to rule out (2) in the long run. But (2) is likely difficult, and it’s probably much more likely than (1).
If I strip my argument of all the details, it basically comes down to: “In the long run, superior intelligence and especially cheap superior intelligence wins the ability to make the important decisions.” Or some other versions I’ve heard:
“Improved technology, including the early steam engine, almost always created more and better jobs for horses. Right up until we had almost fully general replacements for horses.”
“Hey, I haven’t seen Homo erectus around lately.”
This isn’t an argument about specific pathways to a loss of control. Rather, it’s an argument that tireless, copyable, Nobel-prize-winner-level general intelligence which costs less than minimum wage has massive advantages (both economically and in terms of natural selection). In my case, it’s also an argument based on a strong suspicion that alignment of ASI cannot be guaranteed in the long term.
Basically, I see only three viable scenarios which turn out well:
AI fizzle. This would be nice, but I’m not counting on it.
A massive, terrifying incident leading to world-wide treaties against AI, backed up by military force. E.g., “Joint Chinese-US strike forces will bomb your data centers as hard as necessary to shut them down, and the UN and worldwide public will agree you had it coming.”
We ultimately lose control to the AI, but we get lucky, and the AI likes us enough to keep us as pets. We might be able to bias an inevitable loss of control in this direction, with luck. Call this the “Culture scenario.”
Buying time probably helps in scenarios (2) and (3), either because you have a larger window for attempted ASI takeover to fail spectacularly, or because you have more time to bias an inevitable loss of control towards a “humans as well-loved pets” scenario.
(I really need to write up a long-form argument of why I fear that long-term, guaranteed ASI alignment is not a real thing, except in the sense of “initially biasing ASI to be more benevolent pet owners.)
Even more technical details incoming:
In response to this:
I’m going to pass on the question of whether Drexlerian diamond phase nanotech is possible, because there are way too many competing explanations of what happened to nanotech in the 90s and verifying whether Drexlerian nanotech is possible is not worthwhile enough, because I think the non-trivial synthetic biology path is probably enough to mostly replicate the dream of nanotech.
My reasons here come down to the fact that I think that natural selection turned out to miss the potential for reversible computation, and while reversible computers still must pay a minimum energy cost, this is far, far less than irreversible computers must pay, and a fairly important part of my thinking is that for whatever reason, natural selection just didn’t make life intrinsically perform reversible over irreversible computation, meaning that an AI could exploit this to save energy, and my other reason is that reversible computers can do all computational stuff normal computers do, and this is an area where I just disagree with @jacob_cannell the last time we talked about this.
Paper below to prove my point (PDF is available):
Logical reversibility of computation
https://www.semanticscholar.org/paper/Logical-reversibility-of-computation-Bennett/4c7671550671deba9ec318d867522897f20e19ba
And pretty importantly, this alone can get you a lot of OOMs, and I estimated we could get about 15 OOM energy savings just by moving from the Landauer limit to the Margolus-Levitin limit, and this is enough to let you explore far, far more design space than what nature has done so far:
https://www.lesswrong.com/posts/pFaLjmyjBKPdbptPr/does-biology-reliably-find-the-global-maximum-or-at-least#e9ji2ZLy4Aq92RmuN
The universal (at least until we get better physical models) bound on computation is in this paper, which you might like reading:
A Universal Constraint on Computational Rates in Physical Systems
https://arxiv.org/abs/2208.11196
So my general intuition is that the fact that we can drastically lower energy expenditure is enough to make a lot of synthetic life design proposals much more viable than they would be otherwise, and that probably includes most of the specific nanotech examples Drexler proposed.
That said, I agree that this can be made difficult, especially if we apply AI control.
Now, on to the important meat of the discussion.
On this:
I think this argument is correct until this part:
I think this is actually not true, and I think in the long-term, it’s certainly possible to value-align an ASI, though I agree that in the short term, we will absolutely not be confident that our alignment techniques worked.
I do agree that even in good scenarios, it’s very likely the relationship between baseline humans and ASI will look a lot more like the human-pet relationship/benevolent mythological god/angel-human relationships in fiction than any other relationship, it’s just that I do count it as an alignment success if we can get this sort of outcome, because the only thing propping up the outcome is value alignment, and if AIs were as selfish as say most billionaires, far worse outcomes from AI takeover result.
And the role of the AI control agenda is in large part about making AI alignment safe to automate, which is why time matters here.
I do agree that something like AI takeover, in either positive or negative directions is very likely inevitable assuming continued AI progress.
I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
My thoughts on this:
The answer to the question for materials that enable more efficient reversible computers than conventional computers is that currently, they don’t exist, but I interpret the lack of materials so far not much evidence that very efficient reversible computers are impossible, and rather evidence that creating computers at all is unusually difficult compared to other domains, mostly because of the contingencies of how our supply chains are set up, combined with the fact that so far we haven’t had much demand for reversible computation, and unlike most materials that people want here we aren’t asking for a material that we know violates basic physical laws, which I suspect is the only reliable constraint on ASI in the long run.
I think it’s pretty easy to make it quite difficult for the AI to easily figure out nanotech in the time-period that is relevant, so I don’t usually consider nanotech a big threat from AI takeover, and I think the competent researchers not finding any plausible materials so far is a much better signal of this will take real-world experimentation/very high-end simulation, meaning it’s pretty easy to stall for time, than it is a signal that such computers are impossible.
I explicitly agree with these 2 points, for the record:
On this part:
So I have a couple of points to make in response.
1 is that I think alignment progress is pretty disconnectable from interpretability progress, at least in the short term, and I think that a lot of the issues with rule based systems is that they expected complete interpretability at the first go.
This is due to AI control.
2 is that this is why the alignment problem is defined as the problem of how to get AIs that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad from other moral perspectives, and the goal is to make arbitrary goals be chosen without leading to perverse outcomes for the owner of AI systems.
This means that if it’s aligned to 1 human at all, that counts as an alignment success for the purposes of the alignment problem.
John Wentworth has a more complete explanation below:
https://www.lesswrong.com/posts/dHNKtQ3vTBxTfTPxu/what-is-the-alignment-problem
3 is I believe automating AI alignment is pretty valuable, and in the long run I don’t expect alignment to look like a list of rules, I expect it to look like AIs optimizing in the world for human thriving, and I don’t necessarily expect the definition to be anything compact, and that’s fine in my view.
4 is that alignment doesn’t require the AI not taking over, and it’s fine if the AI takes over and makes us pets/we serve in Heaven, and in particular pointed out that it’s totally fine if the AIs make all the decisions, so long as they are near-perfect or perfectly aligned to the human, and in particular what I mean is that the human delegates all of the task to the AI, it’s just that the values are decided by the humans at the start of the AI explosion, even if they aren’t compact and the AI is entirely autonomous in working for the human after that.
The best explanation of how value alignment is supposed to work comes from @Thane Ruthenis’s post below on how a utopia-maximizer would look like:
https://www.lesswrong.com/posts/okkEaevbXCSusBoE2/how-would-an-utopia-maximizer-look-like
(Edited due to a difficult to understand reaction by @Vladimir_Nesov, who can often have pretty confusing ideas to newcomers, so that was a strong signal my words weren’t clarifying enough.)
(Edit 2: I changed goals to values, as I apparently didn’t clarify that goals in my ontology basically correspond to values/morals, and are terminal, not instrumental goals, and gave a link to clarify how value alignment might work).
5 is that to the extent interpretability on AI works, I expect it to have the use case of not understanding everything, but rather intervening on AIs even when we don’t have labeled data.
From Sam Marks:
And I think this is very plausible even if your interpretability isn’t complete or nearly complete from an AI.
But that’s my response to why I think aligning AI is possible at all.
It’s clearer now what you are saying, but I don’t see why you are attributing that point to me specifically (it’s mostly gesturing at value alignment as opposed to intent alignment).
This sounds like permanent disempowerment. Intent alignment to bad decisions would certainly be a problem, but that doesn’t imply denying opportunity for unbounded growth, where in particular eventually decisions won’t have such issues.
If goals are “decided”, then it’s not value alignment, and bad decisions lead to disasters.
(Overall, this framing seems unhelpful when given in response to someone arguing that values are poorly defined.)