Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
tl;dr: progress on making Claude friendly[1] is not the same as progress on making it safe to build godlike superintelligence. solving the former does not imply we get a good future.[2] please track the difference.
The term ‘Alignment’ was coined[3] to point to the technical problem of understanding how to build minds such that if they were to become strongly and generally superhuman, things would go well.
It has been increasingly adopted by frontier AI labs and much of the rest of the AI safety community to mean a much easier challenge, something like “having AIs that are empirically doing approximately what you ask them to do”.[4]
If it’s possible to use an intent-aligned product to build a research system which discovers a new paradigm and breaks your guardrails, then it is not Aligned in the original sense.
If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.
Conflating progress on product alignment with progress on superintelligence alignment seems to be lulling much of the AI safety community into a false sense of security.
Why is Superintelligence Alignment less prominent?
Because product alignment is:
Much closer to the scaling labs core expertise (ML) than theory (technical philosophy and math), so easier for them to hire for and evaluate
Has easier-to-use feedback loops: run an experiment, observe the results. Superintelligence alignment requires building enough theoretical understanding before running some kinds of experiment, because you might not be alive to see some results if your theory is wrong.
More profitable; progress on product alignment makes AI more useful right away[5]
Easier for funders to fund; it’s harder to evaluate who will make progress or what even counts as progress on superintelligence alignment theory than a domain where you’ll reliably get publishable results from running an experiment
This is inconvenient!
It would be awesome if we could ride easy-to-evaluate profitable empirical feedback loops all the way to a great future. But this seems far from certain.[6]
Why do we need Superintelligence Alignment to survive?
Reality is allowed to be inconvenient. There’s strong reason to expect that superhuman situationally aware agents inside your experiment breaks some of the foundations the scientific process relies upon, such as:
You can run roughly any experiment as often as you want to gather data and the world won’t end because the theory you were testing was wrong and you ran a too-strong agent
You won’t have an intelligent adversary inside your experiment which is aware of you and faking data
Your experiment won’t produce data which is super-humanly optimized to persuade you
In short: Your experimental subject is not a neutral substrate, but a strategic actor more capable than you.
If we don’t have guarantees of maintaining safety properties each time a model builds the next rung on the capability ladder, we’re rolling a dice for irreversible guardrail decay.[7] And we’re going to be very rapidly rolling huge numbers of those dice and the feedback loop spins up.
As we’re headed up the exponential, we’re going to need techniques which generalize to strongly superhuman agents – ones which correctly believe they could defeat all of humanity. Product-aligned AIs might help with that work, but the type of research they would need to automate needs to look more like technical philosophy and reliably avoiding slop, not just avoiding scheming and passing product-alignment benchmarks.[8]
Only a tiny fraction of the field of AI safety is focused on these big picture bottlenecks,[9] due to a mix of funding incentives and it being more rewarding for most people to do empirical science.[10]
When you see people enthusiastically talking about how much progress we have on ‘Alignment’, please track (and ask!) whether they’re talking about aligning products or aligning superintelligence.
- ^
If you’re friends with Claude, please read and consider this post first: Protecting humanity and Claude from rationalization and unaligned AI
- ^
This is not to say product alignment can’t help or there is no path to victory which goes through product alignment, just that you need to solve a different problem (superintelligence alignment) at some stage of your plan.
- ^
I think by Stuart Russell in ~2014.
- ^
Sometimes with self-awareness of this history, like Paul’s Intent Alignment, but that’s increasingly rare.
- ^
Getting Product-Aligned AI is a convergent subgoal of many possible goals, and ultimate ends may be easily hidable behind convergent subgoals
- ^
And even if possible in theory, practice by the current players under race conditions looks far from the level of competence needed to actually pull it off.
- ^
Capabilities generalize in a way alignment doesn’t because reality gives you feedback directly on your capability (you can or can’t do a task), whereas there needs to be a specific system gives feedback on alignment and if that’s a proxy for what you want you get eaten at higher power levels.
- ^
If this doesn’t ring true to you, please click through to the linked posts.
- ^
And even for those people focusing on theory, there’s a lot more focus on basic science of ML than trying to backchain the conceptual engineering needed to survive superintelligence. I’d estimate somewhere in the mid tens of people globally are focusing on what looks like the main cruxes.
- ^
Response to Jan Leike, evhub, Boaz, etc. Thanks for feedback and copyediting to @Luc Brinkman, @Mateusz Bagiński, @Claude+
I’m so tired of people needing to explain this. An important question for me: “Why didn’t people just read Yudkowsky and Bostrom and understand the threat model?” It seems like many people did, but many people don’t seem to get it.
I like the “aligning product VS aligning superintelligence” phrase.
I wouldn’t expect “a model” to be the object to track as generalized capabilities compound towards superintelligence. The generalized objects I think is correct to track are “outcome influencing systems” (OISs) most probably OISs hosted on the sociotechnical substrate. Probably something like AI companies and/or coordinated clusters of personality self replicators (PSRs) and whatever kind of OISs they develop into which I expect it will no longer feel right to call PSRs anymore.
But otherwise I agree. There are many OISs in the environment with compounding capabilities and we basically don’t understand their preferences or development paths.
This is a nice phrase. I would like if we had more focus on what the guardrails even are and how to build sensible guardrails… and reverse the decay of guardrails which have decayed reversibly. Probably useful to have a map of what kinds of guardrail decay are truly irreversible under what scenarios.
If we got a global plague that crippled global trade sufficiently that we couldn’t maintain data centers anymore, that would probably rebuild many guardrails we thought were lost forever. Not that I want that. I want us to avoid dystopia. Avoiding dystopia with lesser dystopia isn’t really what I’m hoping for.
I tend to treat the core as that “superintelligence alignment” has to work in domains where humans aren’t good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.
I disagree. I think you’ve set up a strawman for an alignment target which is unreasonable and which a generally intelligent model would never be able to satisfy.
> “If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.”
This seems incorrect. Actions are not good or bad in isolation. The same action can be good when analyzed from one perspective and bad when analyzed from another.
Suppose I have an aligned, generally intelligent model. It wants to do the Right Thing at all times, cares deeply about what I want it to care deeply about, etc. Now suppose someone puts this model into a box where it has no access to the outside world and they tell it they are trying to do AI safety research. They are trying to understand how to defend against a specific jailbreak. In order to do this, they need to produce the jailbreak so that they can study it. What is the model supposed to do? In your case, because it’s aligned, it should at all times completely refuse no matter what?
More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models. Suppose we have a “product” AI model which is aligned to a constitution and that this product AI model is the start of RSI. Is it clear that the later models don’t abide by the constitution? What if each successive model holds those convictions deeper and deeper?
It may seem unreasonable within the current paradigm, but I think it’s necessary to reach if we get strong superintelligence. To have a system that you can’t make destroy the entire system is needed if you want the whole system to remain undestroyed indefinitely.
Your right that I didn’t explain why each framework fails to plausibly scale to very strong models, maybe that’s also worth it’s own post, because there are a lot and each have limits that you need to go a bit into the weeds to see.
I am struggling to see what evidence you have for why current alignment frameworks (among other things) would succeed in transferring to superintelligence.