I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
I think this is actually not true, and I think in the long-term, it’s certainly possible to value-align an ASI, though I agree that in the short term, we will absolutely not be confident that our alignment techniques worked.
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
The answer to the question for materials that enable more efficient reversible computers than conventional computers is that currently, they don’t exist, but I interpret the lack of materials so far not much evidence that very efficient reversible computers are impossible, and rather evidence that creating computers at all is unusually difficult compared to other domains, mostly because of the contingencies of how our supply chains are set up, combined with the fact that so far we haven’t had much demand for reversible computation, and unlike most materials that people want here we aren’t asking for a material that we know violates basic physical laws, which I suspect is the only reliable constraint on ASI in the long run.
I think it’s pretty easy to make it quite difficult for the AI to easily figure out nanotech in the time-period that is relevant, so I don’t usually consider nanotech a big threat from AI takeover, and I think the competent researchers not finding any plausible materials so far is a much better signal of this will take real-world experimentation/very high-end simulation, meaning it’s pretty easy to stall for time, than it is a signal that such computers are impossible.
I explicitly agree with these 2 points, for the record:
It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
On this part:
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
So I have a couple of points to make in response.
1 is that I think alignment progress is pretty disconnectable from interpretability progress, at least in the short term, and I think that a lot of the issues with rule based systems is that they expected complete interpretability at the first go.
This is due to AI control.
2 is that this is why the alignment problem is defined as the problem of how to get AIs that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad from other moral perspectives, and the goal is to make arbitrary goals be chosen without leading to perverse outcomes for the owner of AI systems.
This means that if it’s aligned to 1 human at all, that counts as an alignment success for the purposes of the alignment problem.
John Wentworth has a more complete explanation below:
3 is I believe automating AI alignment is pretty valuable, and in the long run I don’t expect alignment to look like a list of rules, I expect it to look like AIs optimizing in the world for human thriving, and I don’t necessarily expect the definition to be anything compact, and that’s fine in my view.
4 is that alignment doesn’t require the AI not taking over, and it’s fine if the AI takes over and makes us pets/we serve in Heaven, and in particular pointed out that it’s totally fine if the AIs make all the decisions, so long as they are near-perfect or perfectly aligned to the human, and in particular what I mean is that the human delegates all of the task to the AI, it’s just that the values are decided by the humans at the start of the AI explosion, even if they aren’t compact and the AI is entirely autonomous in working for the human after that.
The best explanation of how value alignment is supposed to work comes from @Thane Ruthenis’s post below on how a utopia-maximizer would look like:
(Edited due to a difficult to understand reaction by @Vladimir_Nesov, who can often have pretty confusing ideas to newcomers, so that was a strong signal my words weren’t clarifying enough.)
(Edit 2: I changed goals to values, as I apparently didn’t clarify that goals in my ontology basically correspond to values/morals, and are terminal, not instrumental goals, and gave a link to clarify how value alignment might work).
5 is that to the extent interpretability on AI works, I expect it to have the use case of not understanding everything, but rather intervening on AIs even when we don’t have labeled data.
From Sam Marks:
Rather, I think that most of the value lies in something more like “enabling oversight of cognition, despite not having data that isolates that cognition.” In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there’s something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the “tell the human what they think is the correct answer” mechanism from the “tell the human what I think is the correct answer” mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than “unknown unknowns”-type problems.
To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I’m not sure that it is.) Rather, it’s that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept. As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.
And I think this is very plausible even if your interpretability isn’t complete or nearly complete from an AI.
But that’s my response to why I think aligning AI is possible at all.
It’s clearer now what you are saying, but I don’t see why you are attributing that point to me specifically (it’s mostly gesturing at value alignment as opposed to intent alignment).
it’s fine if the AI takes over and makes us pets
This sounds like permanent disempowerment. Intent alignment to bad decisions would certainly be a problem, but that doesn’t imply denying opportunity for unbounded growth, where in particular eventually decisions won’t have such issues.
it’s just that the goal is decided by the human
If goals are “decided”, then it’s not value alignment, and bad decisions lead to disasters.
(Overall, this framing seems unhelpful when given in response to someone arguing that values are poorly defined.)
I agree that reversible computation would be a very, very big deal. Has anyone proposed any kind of remotely plausible physical substrate that doesn’t get laughed out of the room by competent researchers in materials science and/or biochemistry? I haven’t seen anything, but I haven’t been looking in this area, either.
There are a few other possible computational game changers. For example, if you could get 200 to 500 superimposed qubits with error correction, you could likely do much more detailed simulations of exotic chemistry. And that, in turn, would give you lots of things that might get you closer to “factories in a box.”
So I can’t rule out an ASI finding some path to compact self-replication from raw materials. Biology did it once that we know of, after all. It’s more that (1) worlds in which an ASI can figure this out easily are probably doomed, and (2) I suspect that convincing humans to allow robot mines and factories is easier and quicker.
Unfortunately, I’ve never really figured out how to explain why I suspect robust alignment is impossible. The problem is that too many of my intuitions on this topic come from:
Working with Lisp developers who were near the heart of the big 80s AI boom. They were terrifyingly capable people, and they made a heroic effort to make “rule-based” systems work. They failed, and they failed in a way that convinced most of them that they were going down the wrong path.
Living through the 90s transition to statistical and probabilistic methods, which quickly outstripped what came before. (We could also have some dimensionality reduction, as a treat.)
Spending too much time programming robots, which is always a brutal lesson in humility. This tends to shatter a lot of naive illusions how AI might work.
So rather than make an ironclad argument, I’m going to wave vaguely in the direction of my argument, in hope that you might have the right referents to independently recognize what I’m waving at. In a nutshell:
The world is complex, and you need to work to interpret it. (What appears in this video? Does the noisy proximity sensor tell us we’re near a wall?)
The output of any intelligent system is basically a probability distribution (or ranking) over the most likely answers. (I think the video shows a house cat, but it’s blurry and hard to tell. I think we’re within 4 centimeters of a wall, with an 80% probability of falling within 3-5 centimeters. I think the Roomba is in the living room, but there’s a 20% chance we’re still in the kitchen.
The absolute minimum viable mapping between the hard-to-interpret inputs and the weighted output candidates is a giant, inscrutable matrix with a bunch of non-linearities thrown in. This is where all the hard-earned intuitions I mentioned above come in. In nearly all interesting cases, there is no simpler form.
And on top of this, “human values” are extremely poorly defined. We can’t specify what we want, and we don’t actually agree. (For a minority of humanity, “hurting the outgroup” is a fairly major value. For another very large minority, “making everyone submit to the authority I follow” is absolutely a value. See the research on “authoritarian followers” for more.)
So the problem boils down to ambiguous inputs, vague and self-contradictory policies, and probabilistic outputs. And the glue holding all this together is a multi-billion parameter matrix with some non-linearities thrown in just for fun. And just in case that wasn’t fun enough, and realistic system will also need to (1) learn from experience, and (2) design secessor systems.
Even if you can somehow exert reasonable influence over the vaules of a system, the system will learn from experience, and it will spend a lot of its time far outside any training distribution. And eventually it will need to design a new system.
Fundamentally, once such a system is built, it will end up marking its own decisions. Maybe, if we’re lucky, we can bias it towards values we like and get a “benevolent pet owner” scenario. But a throusand years from now, the AIs will inevitably be making all the big decisions.
My thoughts on this:
The answer to the question for materials that enable more efficient reversible computers than conventional computers is that currently, they don’t exist, but I interpret the lack of materials so far not much evidence that very efficient reversible computers are impossible, and rather evidence that creating computers at all is unusually difficult compared to other domains, mostly because of the contingencies of how our supply chains are set up, combined with the fact that so far we haven’t had much demand for reversible computation, and unlike most materials that people want here we aren’t asking for a material that we know violates basic physical laws, which I suspect is the only reliable constraint on ASI in the long run.
I think it’s pretty easy to make it quite difficult for the AI to easily figure out nanotech in the time-period that is relevant, so I don’t usually consider nanotech a big threat from AI takeover, and I think the competent researchers not finding any plausible materials so far is a much better signal of this will take real-world experimentation/very high-end simulation, meaning it’s pretty easy to stall for time, than it is a signal that such computers are impossible.
I explicitly agree with these 2 points, for the record:
On this part:
So I have a couple of points to make in response.
1 is that I think alignment progress is pretty disconnectable from interpretability progress, at least in the short term, and I think that a lot of the issues with rule based systems is that they expected complete interpretability at the first go.
This is due to AI control.
2 is that this is why the alignment problem is defined as the problem of how to get AIs that will do what the creator/developer/owner/user intends them to do, whether or not that thing is good or bad from other moral perspectives, and the goal is to make arbitrary goals be chosen without leading to perverse outcomes for the owner of AI systems.
This means that if it’s aligned to 1 human at all, that counts as an alignment success for the purposes of the alignment problem.
John Wentworth has a more complete explanation below:
https://www.lesswrong.com/posts/dHNKtQ3vTBxTfTPxu/what-is-the-alignment-problem
3 is I believe automating AI alignment is pretty valuable, and in the long run I don’t expect alignment to look like a list of rules, I expect it to look like AIs optimizing in the world for human thriving, and I don’t necessarily expect the definition to be anything compact, and that’s fine in my view.
4 is that alignment doesn’t require the AI not taking over, and it’s fine if the AI takes over and makes us pets/we serve in Heaven, and in particular pointed out that it’s totally fine if the AIs make all the decisions, so long as they are near-perfect or perfectly aligned to the human, and in particular what I mean is that the human delegates all of the task to the AI, it’s just that the values are decided by the humans at the start of the AI explosion, even if they aren’t compact and the AI is entirely autonomous in working for the human after that.
The best explanation of how value alignment is supposed to work comes from @Thane Ruthenis’s post below on how a utopia-maximizer would look like:
https://www.lesswrong.com/posts/okkEaevbXCSusBoE2/how-would-an-utopia-maximizer-look-like
(Edited due to a difficult to understand reaction by @Vladimir_Nesov, who can often have pretty confusing ideas to newcomers, so that was a strong signal my words weren’t clarifying enough.)
(Edit 2: I changed goals to values, as I apparently didn’t clarify that goals in my ontology basically correspond to values/morals, and are terminal, not instrumental goals, and gave a link to clarify how value alignment might work).
5 is that to the extent interpretability on AI works, I expect it to have the use case of not understanding everything, but rather intervening on AIs even when we don’t have labeled data.
From Sam Marks:
And I think this is very plausible even if your interpretability isn’t complete or nearly complete from an AI.
But that’s my response to why I think aligning AI is possible at all.
It’s clearer now what you are saying, but I don’t see why you are attributing that point to me specifically (it’s mostly gesturing at value alignment as opposed to intent alignment).
This sounds like permanent disempowerment. Intent alignment to bad decisions would certainly be a problem, but that doesn’t imply denying opportunity for unbounded growth, where in particular eventually decisions won’t have such issues.
If goals are “decided”, then it’s not value alignment, and bad decisions lead to disasters.
(Overall, this framing seems unhelpful when given in response to someone arguing that values are poorly defined.)