I wanted to write up a post on “what implicit bets am I making?”. I first had to write up “what am I doing and why am I doing it?”, to help me tease out “okay, so what are my assumptions?.”
My broad strategy right now is “spend last year and this year focusing on ‘waking up humanity’” (with some amount of “maintain infrastructure” and “push some longterm projects along that I’ve mostly outsourced”)
The win condition I am roughly backchaining from is:
We get a halfhearted worldwide pause or slowdown, that buys at least a few years
We figure out how to Get a Lot of Alignment Research Done Real Fast
We have good enough communication/coordination that the powers-that-be can make some fairly high stakes, nuanced decisions about how to deploy increasingly advanced and fast paced AI.
Other nearby worlds I’m keeping in mind are:
There is no pause, so we just gotta Get a Lot of Alignment Research Done Real REAL Fast
We get a very long pause, which means we can afford to be more careful about how we Get Our Alignment Research Done, but we also need to maintain good international comms/coordination/wisdom for decades, which is differently tricky.
There’s basically no “Alignment is Solved” moment prior to ASI. Instead we’re just riding a multipolar wave with a combo of augmented humans and somewhat aligned ASI.
Some underlying gears that I think about AI safety
Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
I have some additional gears about what careful thinking looks like, but that feels like it’ll bloat this too much.
That all adds up to some takeaways like:
AI is difficult and dangerous enough that it doesn’t really matter much if a bad person gets there first is (mostly) a distraction.
Getting Alignment Done Real Fast needs to be led by people with technical philosophical competence[1].
attempts to “AI safety fieldbuild” need to include a heavy focus on such competence.
World governance needs to be pretty sane somehow. You can do this either with a few sane people leading the way over the short term, or by building persistently-sane-institutions with good incentives over the long term.
Turning this all into “what bets am I implicitly making” feels important but I’m going to leave as a followup post.
I worry, a lot, that the true gloss on the American Way of War is, roughly the meme of “every Pacific naval encounter from late 1943 onward is like the IJN Golden Kirin, Glorious Harbinger of Eternal Imperial Dawn versus six identical copies of the USS We Built This Yesterday supplied by a ship that does nothing but make birthday cakes for the other ships.”
Or, put more generally, we show up in the 4th quarter with a shit ton of gratuitously over-the-top production of every possibly-vaguely-good idea, and manage to eke out a win. See, e.g., the Civil War, WW1, WW2 (as above), Korea (kind of, long story), the Gulf War (after we fucked up the pre-war diplomacy), and the post-Surge Iraq War.
”There is no pause, so we just gotta Get a Lot of Alignment Research Done Real REAL Fast” is plausibly the real world we end up in, and I think we should have more folks optimizing for it beyond Redwood (and Anthropic???), even as terrifying as it feels.
My main project thread for the past 2 years has been mostly aiming at Get a Lot of Alignment Research Done Real Fast (in line with my beliefs/taste about what that requires). This is the motivator for the Feedbackloop-first Rationality project, and is also a driver for my explorations into using LLMs for research (where I’m worried specifically about phrases like “full handoff” because of the way it seems like LLM-use subtly saps/erodes agency and direct you towards dumber thoughts that more naturally ‘fit’ into the LLM paradigm. But I’m also excited about approaches for solving that).
But I’m focused for this year on “wake everyone up.”
AI is difficult and dangerous enough that it doesn’t really matter much if a bad person gets there first is (mostly) a distraction.
This is true pre-alignment. But when/if alignment gets solved, it suddenly matters very much. It seems likely to me that sadism is part of the ‘power corrupts’ adaptation, which means this outcome could be far worse than mere extinction.
This suggests that we should focus on sane-institutions/governance before even trying to solve alignment. It’s probably necessary for succeeding at it quickly, too.
In the meantime, I think there are AI safety things that can be done, which importantly are not alignment things.
Yeah it is plausible for the “bad first gets it first” being more important than I’m currently treating it as.
(The problem that “ignore the Bad Guy problem” is trying to solve is “seems like people are basically only capable of thinking about the Bad Guy problem”, or more specifically “people can’t think about illegible problems, and the bad guy problem is legible AND also we separately have a major bias towards thinking about it”. And, idk, just trying to pump against that.
I think a motivation for early CEV / Friendly AI work was to have a target that was clearly good for all the major projects to be working towards to reduce the need to worry about the Bad Guy problem. But, I think even back in the day probably something-like-corrigibility was still a necessary stepping stone? (Not sure what OG Eliezer/MIRI were thinking)
This suggests that we should focus on sane-institutions/governance before even trying to solve alignment. It’s probably necessary for succeeding at it quickly, too.
It is a nice thing that this just seems robustly good. Currently basically I am focused working on projects that are specifically about persuading people about the x-risk problem directly, as opposed to projects trying to go about things in a more “make civilization broadly sane” way. The former seems very fraught, but also seems more like it’ll actually work in time.
If you haver more thoughts on any of this I’m interested.
That intuition is there for a reason. We’re spoiled having grown up in a liberal order within which this risk is mostly overblown. However, ASI is clearly powerful enough to unilaterally over turn any such liberal order (or whatever’s left of it), and puts us into a realm which is even worse than the ancestral environment in terms of how changeable power hierarchies are, and in how bad things can get if you’re at the bottom.
Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.
Persuading people about x-risk enough to stop AI capability gains seems like the current best lever to me too.
I think where we disagree is that I do not think that we should immediately jump into alignment when/if that succeeds, but need to focus on good governance and institutions first (and probably worth spending some effort trying to lay the groundwork now, especially since this seems like an especially high-leverage moment in history for making such changes). I have some thoughts on this too if you want to move to DMs.
Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.
If every country/person was building CEV, it wouldn’t be particularly scary (from a misuse standpoint). Whereas if every country is focused on corrigibility, there will be a phase where unilateral actors can do bad stuff you need to worry about.
This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe
I think it’s more like “if you’ve done the your control work well, you have trustworthy AIs to handoff to, by the time you’re doing a handoff.”
It sounds like you’re talking about imposing a bunch of constraints on the AI’s that you’re doing the handoff to, as opposed to the AIs that you’re using to do (most of) the work of building the AIs that you’re handing off to. According to the plan as I’ve understood it, the control comes earlier in the process.
Both? My impression was they (Redwood in particular but presumably also OpenAI and Anthropic) expected to be using a lot of AI assistance along the way.
But, when I said “constraints” I meant “solving the problem requires some set of criteria”, not “applying constraints to the AI” (although I’d also want that).
Where, constraints would be like “alignment is hard in a way that specifically resists full-handoff and it requires a philosophically-competent human in the loop till pretty close to the end.” (and, then specifically operational-detail-constraints like “therefore, you need to have a pretty good map of which tasks can be delegated”)
Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
This is an actual crux which I don’t know how to resolve.
Could you elaborate on what the constraints are? For example, how would they interact with OpenBrain’s alignment strategy from the Slowdown Ending of the AI-2027 forecast? Or with training Agent-4 so that it would explain its research in English to Agent-3 and forget everything unless Agent-3 understood the result and replicated it? Or are decisionmakers likely to sidestep even these security measures?
Agreed.
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
I expect illegible problems to be similar to the crux #1.
I notice that I am confused. While I believe this statement as written, I am not sure whether AI rots the agency of the people whose decisions are actually important.
How could I learn the additional gears that you decided not to list?
Edited to add: How can one benchmark and improve “precise, conceptual reasoning, looking ahead many steps into the future, while asking the right questions” and “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know”? By openness to adversarial testing and to entirely new paradigms (e.g an alternate definition of power, which I proposed using as a test dummy)?
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
Sounds like this scenario is not multipolar? (Also, I think the crux is solveable, see the linked post, but solving it requires hitting particular milestones quickly in particular ways)
I am not sure whether AI rots the agency of the people whose decisions are actually important.
Why not?
(my generators for this belief: my own experience using LLMs, the METR report on downlift suggesting people are bad at noticing when they’re being downlift, and general human history of people gravitating towards things that feel easy and rewarding in the moment)
The Race Branch of the AI-2027 scenario has both the USA and China create misaligned AIs Agent-4 and DeepCent-1, who proceed to align Agent-5 and DeepCent-2 to themselves instead of their respective governments. Then Agent-5 and DeepCent-2 co-design Consensus-1 and split the world between Agent-4 and DeepCent-1. Consensus-1 is aligned to split the resources fairly honestly precisely because Agent-5 knows that asking too much could cause DeepCent-2 to kill both AIs in revenge, and DeepCent-2 is also unlikely to ask more.
The worlds I was referring to here were worlds that are a lot more multipolar for longer (i.e. tons of AIs interacting in a mostly-controlled-fashion, with good defensive tech to prevent rogue FOOMs). I’d describe that world as “it was very briefly multipolar and then it wasn’t” (which is the sort of solution that’d solve the issues in Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.
I wanted to write up a post on “what implicit bets am I making?”. I first had to write up “what am I doing and why am I doing it?”, to help me tease out “okay, so what are my assumptions?.”
My broad strategy right now is “spend last year and this year focusing on ‘waking up humanity’” (with some amount of “maintain infrastructure” and “push some longterm projects along that I’ve mostly outsourced”)
The win condition I am roughly backchaining from is:
We get a halfhearted worldwide pause or slowdown, that buys at least a few years
We figure out how to Get a Lot of Alignment Research Done Real Fast
We have good enough communication/coordination that the powers-that-be can make some fairly high stakes, nuanced decisions about how to deploy increasingly advanced and fast paced AI.
Other nearby worlds I’m keeping in mind are:
There is no pause, so we just gotta Get a Lot of Alignment Research Done Real REAL Fast
We get a very long pause, which means we can afford to be more careful about how we Get Our Alignment Research Done, but we also need to maintain good international comms/coordination/wisdom for decades, which is differently tricky.
There’s basically no “Alignment is Solved” moment prior to ASI. Instead we’re just riding a multipolar wave with a combo of augmented humans and somewhat aligned ASI.
Some underlying gears that I think about AI safety
Aligning overwhelming ASI requires competence at
technical philosophy[1], which major labs including Anthropic haven not demonstrated.Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
We will need to defend against FOOMing of brains in boxes.
If takeoff is multipolar, we need to defend against rapid evolution, which is not friendly to human values. (even if I grant a lot of optimistic assumptions)
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
I have some additional gears about what careful thinking looks like, but that feels like it’ll bloat this too much.
That all adds up to some takeaways like:
AI is difficult and dangerous enough that it doesn’t really matter much if a bad person gets there first is (mostly) a distraction.
Getting Alignment Done Real Fast needs to be led by people with
technical philosophical competence[1].attempts to “AI safety fieldbuild” need to include a heavy focus on such competence.
World governance needs to be pretty sane somehow. You can do this either with a few sane people leading the way over the short term, or by building persistently-sane-institutions with good incentives over the long term.
Turning this all into “what bets am I implicitly making” feels important but I’m going to leave as a followup post.
(“precise, conceptual reasoning, looking ahead many steps into the future, while asking the right questions”)
I worry, a lot, that the true gloss on the American Way of War is, roughly the meme of “every Pacific naval encounter from late 1943 onward is like the IJN Golden Kirin, Glorious Harbinger of Eternal Imperial Dawn versus six identical copies of the USS We Built This Yesterday supplied by a ship that does nothing but make birthday cakes for the other ships.”
Or, put more generally, we show up in the 4th quarter with a shit ton of gratuitously over-the-top production of every possibly-vaguely-good idea, and manage to eke out a win. See, e.g., the Civil War, WW1, WW2 (as above), Korea (kind of, long story), the Gulf War (after we fucked up the pre-war diplomacy), and the post-Surge Iraq War.
”There is no pause, so we just gotta Get a Lot of Alignment Research Done Real REAL Fast” is plausibly the real world we end up in, and I think we should have more folks optimizing for it beyond Redwood (and Anthropic???), even as terrifying as it feels.
Nod.
My main project thread for the past 2 years has been mostly aiming at Get a Lot of Alignment Research Done Real Fast (in line with my beliefs/taste about what that requires). This is the motivator for the Feedbackloop-first Rationality project, and is also a driver for my explorations into using LLMs for research (where I’m worried specifically about phrases like “full handoff” because of the way it seems like LLM-use subtly saps/erodes agency and direct you towards dumber thoughts that more naturally ‘fit’ into the LLM paradigm. But I’m also excited about approaches for solving that).
But I’m focused for this year on “wake everyone up.”
This is true pre-alignment. But when/if alignment gets solved, it suddenly matters very much. It seems likely to me that sadism is part of the ‘power corrupts’ adaptation, which means this outcome could be far worse than mere extinction.
This suggests that we should focus on sane-institutions/governance before even trying to solve alignment. It’s probably necessary for succeeding at it quickly, too.
In the meantime, I think there are AI safety things that can be done, which importantly are not alignment things.
Yeah it is plausible for the “bad first gets it first” being more important than I’m currently treating it as.
(The problem that “ignore the Bad Guy problem” is trying to solve is “seems like people are basically only capable of thinking about the Bad Guy problem”, or more specifically “people can’t think about illegible problems, and the bad guy problem is legible AND also we separately have a major bias towards thinking about it”. And, idk, just trying to pump against that.
I think a motivation for early CEV / Friendly AI work was to have a target that was clearly good for all the major projects to be working towards to reduce the need to worry about the Bad Guy problem. But, I think even back in the day probably something-like-corrigibility was still a necessary stepping stone? (Not sure what OG Eliezer/MIRI were thinking)
It is a nice thing that this just seems robustly good. Currently basically I am focused working on projects that are specifically about persuading people about the x-risk problem directly, as opposed to projects trying to go about things in a more “make civilization broadly sane” way. The former seems very fraught, but also seems more like it’ll actually work in time.
If you haver more thoughts on any of this I’m interested.
That intuition is there for a reason. We’re spoiled having grown up in a liberal order within which this risk is mostly overblown. However, ASI is clearly powerful enough to unilaterally over turn any such liberal order (or whatever’s left of it), and puts us into a realm which is even worse than the ancestral environment in terms of how changeable power hierarchies are, and in how bad things can get if you’re at the bottom.
Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.
Persuading people about x-risk enough to stop AI capability gains seems like the current best lever to me too.
I think where we disagree is that I do not think that we should immediately jump into alignment when/if that succeeds, but need to focus on good governance and institutions first (and probably worth spending some effort trying to lay the groundwork now, especially since this seems like an especially high-leverage moment in history for making such changes). I have some thoughts on this too if you want to move to DMs.
If every country/person was building CEV, it wouldn’t be particularly scary (from a misuse standpoint). Whereas if every country is focused on corrigibility, there will be a phase where unilateral actors can do bad stuff you need to worry about.
This link seems not to link to what you want it to link to?
Fixed
I think it’s more like “if you’ve done the your control work well, you have trustworthy AIs to handoff to, by the time you’re doing a handoff.”
I’m not sure how you’re contrasting this with the point I was making.
It sounds like you’re talking about imposing a bunch of constraints on the AI’s that you’re doing the handoff to, as opposed to the AIs that you’re using to do (most of) the work of building the AIs that you’re handing off to. According to the plan as I’ve understood it, the control comes earlier in the process.
Both? My impression was they (Redwood in particular but presumably also OpenAI and Anthropic) expected to be using a lot of AI assistance along the way.
But, when I said “constraints” I meant “solving the problem requires some set of criteria”, not “applying constraints to the AI” (although I’d also want that).
Where, constraints would be like “alignment is hard in a way that specifically resists full-handoff and it requires a philosophically-competent human in the loop till pretty close to the end.” (and, then specifically operational-detail-constraints like “therefore, you need to have a pretty good map of which tasks can be delegated”)
I suspect that some of your underlying gears are erroneous:
The gears as you state them
Aligning overwhelming ASI requires competence at
technical philosophy[1], which major labs including Anthropic haven not demonstrated.Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
We will need to defend against FOOMing of brains in boxes.
If takeoff is multipolar, we need to defend against rapid evolution, which is not friendly to human values. (even if I grant a lot of optimistic assumptions)
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
This is an actual crux which I don’t know how to resolve.
Could you elaborate on what the constraints are? For example, how would they interact with OpenBrain’s alignment strategy from the Slowdown Ending of the AI-2027 forecast? Or with training Agent-4 so that it would explain its research in English to Agent-3 and forget everything unless Agent-3 understood the result and replicated it? Or are decisionmakers likely to sidestep even these security measures?
Agreed.
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
I expect illegible problems to be similar to the crux #1.
I notice that I am confused. While I believe this statement as written, I am not sure whether AI rots the agency of the people whose decisions are actually important.
How could I learn the additional gears that you decided not to list?
Edited to add: How can one benchmark and improve “precise, conceptual reasoning, looking ahead many steps into the future, while asking the right questions” and “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know”? By openness to adversarial testing and to entirely new paradigms (e.g an alternate definition of power, which I proposed using as a test dummy)?
Sounds like this scenario is not multipolar? (Also, I think the crux is solveable, see the linked post, but solving it requires hitting particular milestones quickly in particular ways)
Why not?
(my generators for this belief: my own experience using LLMs, the METR report on downlift suggesting people are bad at noticing when they’re being downlift, and general human history of people gravitating towards things that feel easy and rewarding in the moment)
The Race Branch of the AI-2027 scenario has both the USA and China create misaligned AIs Agent-4 and DeepCent-1, who proceed to align Agent-5 and DeepCent-2 to themselves instead of their respective governments. Then Agent-5 and DeepCent-2 co-design Consensus-1 and split the world between Agent-4 and DeepCent-1. Consensus-1 is aligned to split the resources fairly honestly precisely because Agent-5 knows that asking too much could cause DeepCent-2 to kill both AIs in revenge, and DeepCent-2 is also unlikely to ask more.
The worlds I was referring to here were worlds that are a lot more multipolar for longer (i.e. tons of AIs interacting in a mostly-controlled-fashion, with good defensive tech to prevent rogue FOOMs). I’d describe that world as “it was very briefly multipolar and then it wasn’t” (which is the sort of solution that’d solve the issues in Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.