What I eventually realized is that this line of argument is a perfect rebuttal of the whole mesa-optimization neurosis that has popped up, but it doesn’t actually give us AI safety because it completely breaks down once you apply it to e.g. law enforcement or warfare.
I agree this doesn’t perfectly solve the AI safety problem, and my guess is that the reason this doesn’t work for law enforcement/warfare is because the instrumental goals are adversarial, such that you are not incentivized to not break other agent’s goals.
However, if something like the plan from John Wentworth’s post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I’d encourage you to write up your results on that line of argument anyway.
However, if something like the plan from John Wentworth’s post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
How?
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I’d encourage you to write up your results on that line of argument anyway.
I didn’t really get any further than John Wentworth’s post here. But also I’ve been a lot less spooked by LLMs than Eliezer Yudkowsky.
Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they’d eventually takeover and kill everyone, and this includes all plans for automating alignment.
Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don’t really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It’s generally not economically feasible to develop AGI for a single person, so it doesn’t really happen.
Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don’t matter, because they have a resolution procedure that is very simple to implement.
I don’t really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
This definitely can happen, though I’d argue in practice it wouldn’t go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.
Note this doesn’t look like a pivotal act, and this is not coincidental here, because real life heroism doesn’t look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.
I’m not interested in your key property, I’m interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.
I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can’t be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there’s no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I’m much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.
This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don’t appreciate economic/computational constraints on why something is the way it is.
(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).
So I’m much less worried about slop than John Wentworth is.
My new concept for “pivotal act that stops the world from getting to ASI, even though we get to AGI” is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.
Once you start getting involved with governance, you’re going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
I disagree. There is such a lot of conflicts (some kinds make it into writing, some just happen) of different scales, both in history and now; I believe they span human conflict space almost fully. Just aggregating this information could lead to very good advice on handling everything, which AI could act upon if it so needed.
I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren’t. Why do I think they’re plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn’t seem like a very complex claim, so I assign decent odds to it. There’s some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that’s not super strong, and there are other simple hypotheses for how things go, so I don’t assign more than like 10% credence to the hypothesis.
It’s still not obvious to me why adversaries are a big issue. If I’m acting against an adversary, it seems like I won’t make counter-plans that lead to lots of side-effects either, for the same reasons they won’t.
It’s still not obvious to me why adversaries are a big issue. If I’m acting against an adversary, it seems like I won’t make counter-plans that lead to lots of side-effects either, for the same reasons they won’t.
I mean we can start by noticing that historically, optimization in the presence of adversaries has lead to huge things. The world wars wrecked Europe. States and large bureaucratic organizations probably exist mainly as a consequence of farm raids. The immune system tends to stress out the body a lot when it is dealing with an infection. While it didn’t actually trigger, the nuclear arms race lead to existential risk for humanity, and even though it didn’t trigger the destruction, it still made people quite afraid of e.g. nuclear power. Etc..
Now, why does trying to destroy a hostile optimizer tend to cause so much destruction? I feel like the question almost answers itself.
Or if we want to go mechanistic about it, one of the ways to fight back the nazis is with bombs, which deliver a sudden shockwave of energy that has the property of destroying nazi structures and everything else. It’s almost constitutive of the alignment problem: we have a lot of ways of influencing the world a lot, but those methods do not discriminate between good and evil/bad.
From an abstract point of view, many coherence theorems rely on e.g. Dutch books, and thus become much more applicable in the case of adversaries. The coherence theorem “if an agent achieves its goals robustly regardless of environment, then it stops people who want to shut it down” can be trivially restated as “either an agent does not achieve its goals robustly regardless of environment, or it stops people who want to shut it down”, and here non-adversarial agents should obviously choose the former branch (to be corrigble, you need to not achieve your goals in an environment where someone is trying to shut you down).
From a more strategic point of view, when dealing with an adversary, you tend to become a lot more constrained on resources because if the adversary can find a way to drain your resources, then it will try to do so. Ways to succeed include:
Making it harder for people to trick you into losing resources, by e.g. making it harder for people to predict you, and being less trusting of what people tell you, and wining as quickly as possible
Gaining more resources by grabbing them from elsewhere
Also, in an adversarial context, a natural prior is that inconveniences are there for a reason, namely to interfere with you. This tends to make enemies.
I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren’t. Why do I think they’re plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn’t seem like a very complex claim, so I assign decent odds to it. There’s some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that’s not super strong, and there are other simple hypotheses for how things go, so I don’t assign more than like 10% credence to the hypothesis.
Mesa-optimizers definitely exist to varying degrees, but they generally try to not get too involved with other things. Mechanistically, we can attribute this to imitation learning, since they’re trying to mimick human’s tendency to stitch together strategies in a reasonable way. Abstractly, the friendliness of instrumental goals shows us why unbounded unfriendly utility maximizers are not the only or even main attractor here.
(… Some people might say that we have a mathematical model of unbounded unfriendly utility maximizers but not of friendlier bounded instrumental optimizers. But those people are wrong because the model of utility maximizers assumes we have an epistemic oracle to handle the updating, prediction and optimization for us, and really that’s the computationally heavy part. One of the advantages of more bounded optimization like in the OP is that it ought to be more computationally tractable because different parts of the plans interfere less with each other. It’s not really fair to say that we know how utility maximizers work when they outsource the important part to the assumptions.)
What I eventually realized is that this line of argument is a perfect rebuttal of the whole mesa-optimization neurosis that has popped up, but it doesn’t actually give us AI safety because it completely breaks down once you apply it to e.g. law enforcement or warfare.
I agree this doesn’t perfectly solve the AI safety problem, and my guess is that the reason this doesn’t work for law enforcement/warfare is because the instrumental goals are adversarial, such that you are not incentivized to not break other agent’s goals.
However, if something like the plan from John Wentworth’s post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I’d encourage you to write up your results on that line of argument anyway.
How?
I didn’t really get any further than John Wentworth’s post here. But also I’ve been a lot less spooked by LLMs than Eliezer Yudkowsky.
Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they’d eventually takeover and kill everyone, and this includes all plans for automating alignment.
Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don’t really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It’s generally not economically feasible to develop AGI for a single person, so it doesn’t really happen.
Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don’t matter, because they have a resolution procedure that is very simple to implement.
I don’t think the people develop AGI have clear or coherent wishes for how the AGI should treat most other people.
Agree with this, but 2 things:
I expect people to develop clearer and coherent wishes once they actually realize that they might have nation-state level power.
Most versions of incoherent/unclear wishes for other humans do not result in existential catastrophe, relative to other failure modes for AI safety.
I don’t really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
This definitely can happen, though I’d argue in practice it wouldn’t go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.
Note this doesn’t look like a pivotal act, and this is not coincidental here, because real life heroism doesn’t look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.
I’m not interested in your key property, I’m interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.
I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can’t be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there’s no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I’m much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.
This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don’t appreciate economic/computational constraints on why something is the way it is.
Comments below:
https://www.lesswrong.com/posts/8wBN8cdNAv3c7vt6p/the-case-against-ai-control-research#FembwXfYSwnwxzWbC
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP#g4N9Pdj8mQioRe43q
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#3TFECJ3urX6wLre5n
The posts I disagree with:
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general
https://www.lesswrong.com/posts/MMAK6eeMCH3JGuqeZ/everything-i-need-to-know-about-takeoff-speeds-i-learned
https://www.lesswrong.com/posts/hsqKp56whpPEQns3Z/why-large-bureaucratic-organizations
(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).
So I’m much less worried about slop than John Wentworth is.
If you’re assuming that verification is easier than generation, you’re pretty much a non-player when it comes to alignment.
My new concept for “pivotal act that stops the world from getting to ASI, even though we get to AGI” is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.
Once you start getting involved with governance, you’re going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
I disagree. There is such a lot of conflicts (some kinds make it into writing, some just happen) of different scales, both in history and now; I believe they span human conflict space almost fully. Just aggregating this information could lead to very good advice on handling everything, which AI could act upon if it so needed.
Could you unpack both clauses of this sentence? It’s not obvious to me why they are true.
For the former I’d need to hear your favorite argument in favor of the neurosis that inner alignment is a major problem.
For the latter, in the presence of adversaries, every subgoal has to be robust against those adversaries, which is very unfriendly.
I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren’t. Why do I think they’re plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn’t seem like a very complex claim, so I assign decent odds to it. There’s some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that’s not super strong, and there are other simple hypotheses for how things go, so I don’t assign more than like 10% credence to the hypothesis.
It’s still not obvious to me why adversaries are a big issue. If I’m acting against an adversary, it seems like I won’t make counter-plans that lead to lots of side-effects either, for the same reasons they won’t.
I mean we can start by noticing that historically, optimization in the presence of adversaries has lead to huge things. The world wars wrecked Europe. States and large bureaucratic organizations probably exist mainly as a consequence of farm raids. The immune system tends to stress out the body a lot when it is dealing with an infection. While it didn’t actually trigger, the nuclear arms race lead to existential risk for humanity, and even though it didn’t trigger the destruction, it still made people quite afraid of e.g. nuclear power. Etc..
Now, why does trying to destroy a hostile optimizer tend to cause so much destruction? I feel like the question almost answers itself.
Or if we want to go mechanistic about it, one of the ways to fight back the nazis is with bombs, which deliver a sudden shockwave of energy that has the property of destroying nazi structures and everything else. It’s almost constitutive of the alignment problem: we have a lot of ways of influencing the world a lot, but those methods do not discriminate between good and evil/bad.
From an abstract point of view, many coherence theorems rely on e.g. Dutch books, and thus become much more applicable in the case of adversaries. The coherence theorem “if an agent achieves its goals robustly regardless of environment, then it stops people who want to shut it down” can be trivially restated as “either an agent does not achieve its goals robustly regardless of environment, or it stops people who want to shut it down”, and here non-adversarial agents should obviously choose the former branch (to be corrigble, you need to not achieve your goals in an environment where someone is trying to shut you down).
From a more strategic point of view, when dealing with an adversary, you tend to become a lot more constrained on resources because if the adversary can find a way to drain your resources, then it will try to do so. Ways to succeed include:
Making it harder for people to trick you into losing resources, by e.g. making it harder for people to predict you, and being less trusting of what people tell you, and wining as quickly as possible
Gaining more resources by grabbing them from elsewhere
Also, in an adversarial context, a natural prior is that inconveniences are there for a reason, namely to interfere with you. This tends to make enemies.
Mesa-optimizers definitely exist to varying degrees, but they generally try to not get too involved with other things. Mechanistically, we can attribute this to imitation learning, since they’re trying to mimick human’s tendency to stitch together strategies in a reasonable way. Abstractly, the friendliness of instrumental goals shows us why unbounded unfriendly utility maximizers are not the only or even main attractor here.
(… Some people might say that we have a mathematical model of unbounded unfriendly utility maximizers but not of friendlier bounded instrumental optimizers. But those people are wrong because the model of utility maximizers assumes we have an epistemic oracle to handle the updating, prediction and optimization for us, and really that’s the computationally heavy part. One of the advantages of more bounded optimization like in the OP is that it ought to be more computationally tractable because different parts of the plans interfere less with each other. It’s not really fair to say that we know how utility maximizers work when they outsource the important part to the assumptions.)