However, if something like the plan from John Wentworth’s post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
How?
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I’d encourage you to write up your results on that line of argument anyway.
I didn’t really get any further than John Wentworth’s post here. But also I’ve been a lot less spooked by LLMs than Eliezer Yudkowsky.
Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they’d eventually takeover and kill everyone, and this includes all plans for automating alignment.
Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don’t really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It’s generally not economically feasible to develop AGI for a single person, so it doesn’t really happen.
Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don’t matter, because they have a resolution procedure that is very simple to implement.
I don’t really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
This definitely can happen, though I’d argue in practice it wouldn’t go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.
Note this doesn’t look like a pivotal act, and this is not coincidental here, because real life heroism doesn’t look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.
I’m not interested in your key property, I’m interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.
I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can’t be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there’s no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I’m much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.
This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don’t appreciate economic/computational constraints on why something is the way it is.
(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).
So I’m much less worried about slop than John Wentworth is.
My new concept for “pivotal act that stops the world from getting to ASI, even though we get to AGI” is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.
Once you start getting involved with governance, you’re going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
I disagree. There is such a lot of conflicts (some kinds make it into writing, some just happen) of different scales, both in history and now; I believe they span human conflict space almost fully. Just aggregating this information could lead to very good advice on handling everything, which AI could act upon if it so needed.
How?
I didn’t really get any further than John Wentworth’s post here. But also I’ve been a lot less spooked by LLMs than Eliezer Yudkowsky.
Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they’d eventually takeover and kill everyone, and this includes all plans for automating alignment.
Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can’t really know how to adapt and improve it without actually involving it in those conflicts.
Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don’t really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It’s generally not economically feasible to develop AGI for a single person, so it doesn’t really happen.
Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don’t matter, because they have a resolution procedure that is very simple to implement.
I don’t think the people develop AGI have clear or coherent wishes for how the AGI should treat most other people.
Agree with this, but 2 things:
I expect people to develop clearer and coherent wishes once they actually realize that they might have nation-state level power.
Most versions of incoherent/unclear wishes for other humans do not result in existential catastrophe, relative to other failure modes for AI safety.
I don’t really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
This definitely can happen, though I’d argue in practice it wouldn’t go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
I don’t understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.
Note this doesn’t look like a pivotal act, and this is not coincidental here, because real life heroism doesn’t look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.
I’m not interested in your key property, I’m interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.
I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can’t be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there’s no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I’m much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.
This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don’t appreciate economic/computational constraints on why something is the way it is.
Comments below:
https://www.lesswrong.com/posts/8wBN8cdNAv3c7vt6p/the-case-against-ai-control-research#FembwXfYSwnwxzWbC
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP#g4N9Pdj8mQioRe43q
https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#3TFECJ3urX6wLre5n
The posts I disagree with:
https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general
https://www.lesswrong.com/posts/MMAK6eeMCH3JGuqeZ/everything-i-need-to-know-about-takeoff-speeds-i-learned
https://www.lesswrong.com/posts/hsqKp56whpPEQns3Z/why-large-bureaucratic-organizations
(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).
So I’m much less worried about slop than John Wentworth is.
If you’re assuming that verification is easier than generation, you’re pretty much a non-player when it comes to alignment.
My new concept for “pivotal act that stops the world from getting to ASI, even though we get to AGI” is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.
Once you start getting involved with governance, you’re going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
I disagree. There is such a lot of conflicts (some kinds make it into writing, some just happen) of different scales, both in history and now; I believe they span human conflict space almost fully. Just aggregating this information could lead to very good advice on handling everything, which AI could act upon if it so needed.