My impression is that the current Real Actual Alignment Plan For Real This Time amongst medium p(Doom) people looks something like this:
Advance AI control, evals, and monitoring as much as possible now
Try and catch an AI doing a maximally-incriminating thing at roughly human level
This causes [something something better governance to buy time]
Use the almost-world-ending AI to “automate alignment research”
(Ignoring the possibility of a pivotal act to shut down AI research. Most people I talk to don’t think this is reasonable.)
I’ll ignore the practicality of 3. What do people expect 4 to look like? What does an AI assisted value alignment solution look like?
My rough guess of what it could be, i.e. the highest p(solution is this|AI gives us a real alignment solution) is something like the following. This tries to straddle the line between the helper AI being obviously powerful enough to kill us and obviously too dumb to solve alignment:
Formalize the concept of “empowerment of an agent” as a property of causal networks with the help of theorem-proving AI.
Modify today’s autoregressive reasoning models into something more isomorphic to a symbolic casual network. Use some sort of minimal circuit system (mixture of depths?) and prove isomorphism between the reasoning traces and the external environment.
Identify “humans” in the symbolic world-model, using automated mech interp.
Target a search of outcomes towards the empowerment of humans, as defined in 1.
Is this what people are hoping plops out of an automated alignment researcher? I sometimes get the impression that most people have no idea whatsoever how the plan works, which means they’re imagining the alignment-AI to be essentially magic. The problem with this is that magic-level AI is definitely powerful enough to just kill everyone.
Is your question more about “what’s the actual structure of the ‘solve alignment’ part”, or “how are you supposed to use powerful AIs to help with this?”
I think there’s one structure-of-plan that is sort of like your outline (I think it is similar to John Wentworth’s plan but sort of skipping ahead past some steps and being more-specific-about-the-final-solution which means more wrong)
(I don’t think John self-identifies as particularly oriented around your “4 steps from AI control to automate alignment research”. I haven’t heard the people who say ‘let’s automate alignment research’ say anything that sounded very coherent. I think many people are thinking something like “what if we had a LOT of interpretability?” but IMO not really thinking through the next steps needed for that interpretability to be useful in the endgame.)
STEM AI → Pivotal Act
I haven’t heard anyone talk about this for awhile, but a few years back I heard a cluster of plans that were something like “build STEM AI with very narrow ability to think, which you could be confident couldn’t model humans at all, which would only think about resources inside a 10′ by 10′ cube, and then use that to invent the pre-requisites for uploading or biological intelligence enhancement, and then ??? → very smart humans running at fast speeds figure out how to invent a pivotal technology.”
I don’t think the LLM-centric era lends itself well to this plan. But, I could see a route where you get a less-robust-and-thus-necessarily-weaker STEM AI trained on a careful STEM corpus with careful control and asking it carefully scoped questions, which could maybe be more powerful than you could get away with for more generically competent LLMs.
Yes, a human-uploading or human-enhancing pivotal act might actually be something people are thinking about. Yudkowsky gives his nanotech-GPU-melting pivotal act example, which—while he has stipulated that it’s not his real plan—still anchors “pivotal act” on “build the most advanced weapon system of all time and carry out a first-strike”. This is not something that governments (and especially companies) can or should really talk about as a plan, since threatening a first-strike on your geopolitical opponents does not a cooperative atmosphere make.
(though I suppose a series of targeted, conventional strikes on data centers and chip factories across the world might be on the pareto-frontier of “good” vs “likely” outcomes)
My question was an attempt to trigger a specific mental motion in a certain kind of individual. Specifically, I was hoping for someone who endorses that overall plan to envisage how it would work end-to-end, using their inner sim.
My example was basically what I get when I query my inner sim, conditional on that plan going well.
My impression is that the current Real Actual Alignment Plan For Real This Time amongst medium p(Doom) people looks something like this:
Advance AI control, evals, and monitoring as much as possible now
Try and catch an AI doing a maximally-incriminating thing at roughly human level
This causes [something something better governance to buy time]
Use the almost-world-ending AI to “automate alignment research”
(Ignoring the possibility of a pivotal act to shut down AI research. Most people I talk to don’t think this is reasonable.)
I’ll ignore the practicality of 3. What do people expect 4 to look like? What does an AI assisted value alignment solution look like?
My rough guess of what it could be, i.e. the highest p(solution is this|AI gives us a real alignment solution) is something like the following. This tries to straddle the line between the helper AI being obviously powerful enough to kill us and obviously too dumb to solve alignment:
Formalize the concept of “empowerment of an agent” as a property of causal networks with the help of theorem-proving AI.
Modify today’s autoregressive reasoning models into something more isomorphic to a symbolic casual network. Use some sort of minimal circuit system (mixture of depths?) and prove isomorphism between the reasoning traces and the external environment.
Identify “humans” in the symbolic world-model, using automated mech interp.
Target a search of outcomes towards the empowerment of humans, as defined in 1.
Is this what people are hoping plops out of an automated alignment researcher? I sometimes get the impression that most people have no idea whatsoever how the plan works, which means they’re imagining the alignment-AI to be essentially magic. The problem with this is that magic-level AI is definitely powerful enough to just kill everyone.
Is your question more about “what’s the actual structure of the ‘solve alignment’ part”, or “how are you supposed to use powerful AIs to help with this?”
I think there’s one structure-of-plan that is sort of like your outline (I think it is similar to John Wentworth’s plan but sort of skipping ahead past some steps and being more-specific-about-the-final-solution which means more wrong)
(I don’t think John self-identifies as particularly oriented around your “4 steps from AI control to automate alignment research”. I haven’t heard the people who say ‘let’s automate alignment research’ say anything that sounded very coherent. I think many people are thinking something like “what if we had a LOT of interpretability?” but IMO not really thinking through the next steps needed for that interpretability to be useful in the endgame.)
STEM AI → Pivotal Act
I haven’t heard anyone talk about this for awhile, but a few years back I heard a cluster of plans that were something like “build STEM AI with very narrow ability to think, which you could be confident couldn’t model humans at all, which would only think about resources inside a 10′ by 10′ cube, and then use that to invent the pre-requisites for uploading or biological intelligence enhancement, and then ??? → very smart humans running at fast speeds figure out how to invent a pivotal technology.”
I don’t think the LLM-centric era lends itself well to this plan. But, I could see a route where you get a less-robust-and-thus-necessarily-weaker STEM AI trained on a careful STEM corpus with careful control and asking it carefully scoped questions, which could maybe be more powerful than you could get away with for more generically competent LLMs.
Yes, a human-uploading or human-enhancing pivotal act might actually be something people are thinking about. Yudkowsky gives his nanotech-GPU-melting pivotal act example, which—while he has stipulated that it’s not his real plan—still anchors “pivotal act” on “build the most advanced weapon system of all time and carry out a first-strike”. This is not something that governments (and especially companies) can or should really talk about as a plan, since threatening a first-strike on your geopolitical opponents does not a cooperative atmosphere make.
(though I suppose a series of targeted, conventional strikes on data centers and chip factories across the world might be on the pareto-frontier of “good” vs “likely” outcomes)
My question was an attempt to trigger a specific mental motion in a certain kind of individual. Specifically, I was hoping for someone who endorses that overall plan to envisage how it would work end-to-end, using their inner sim.
My example was basically what I get when I query my inner sim, conditional on that plan going well.