Product Alignment is not Superintelligence Alignment (and we need the latter to survive)

tl;dr: progress on making Claude friendly[1] is not the same as progress on making it safe to build godlike superintelligence. solving the former does not imply we get a good future.[2] please track the difference.

The term ‘Alignment’ was coined[3] to point to the technical problem of understanding how to build minds such that if they were to become strongly and generally superhuman, things would go well.

It has been increasingly adopted by frontier AI labs and much of the rest of the AI safety community to mean a much easier challenge, something like “having AIs that are empirically doing approximately what you ask them to do”.[4]

If it’s possible to use an intent-aligned product to build a research system which discovers a new paradigm and breaks your guardrails, then it is not Aligned in the original sense.

If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.

Conflating progress on product alignment with progress on superintelligence alignment seems to be lulling much of the AI safety community into a false sense of security.

Why is Superintelligence Alignment less prominent?

Because product alignment is:

  • Much closer to the scaling labs core expertise (ML) than theory (technical philosophy and math), so easier for them to hire for and evaluate

  • Has easier-to-use feedback loops: run an experiment, observe the results. Superintelligence alignment requires building enough theoretical understanding before running some kinds of experiment, because you might not be alive to see some results if your theory is wrong.

  • More profitable; progress on product alignment makes AI more useful right away[5]

  • Easier for funders to fund; it’s harder to evaluate who will make progress or what even counts as progress on superintelligence alignment theory than a domain where you’ll reliably get publishable results from running an experiment

This is inconvenient!

It would be awesome if we could ride easy-to-evaluate profitable empirical feedback loops all the way to a great future. But this seems far from certain.[6]

Why do we need Superintelligence Alignment to survive?

Reality is allowed to be inconvenient. There’s strong reason to expect that superhuman situationally aware agents inside your experiment breaks some of the foundations the scientific process relies upon, such as:

  • You can run roughly any experiment as often as you want to gather data and the world won’t end because the theory you were testing was wrong and you ran a too-strong agent

  • You won’t have an intelligent adversary inside your experiment which is aware of you and faking data

  • Your experiment won’t produce data which is super-humanly optimized to persuade you

In short: Your experimental subject is not a neutral substrate, but a strategic actor more capable than you.

If we don’t have guarantees of maintaining safety properties each time a model builds the next rung on the capability ladder, we’re rolling a dice for irreversible guardrail decay.[7] And we’re going to be very rapidly rolling huge numbers of those dice and the feedback loop spins up.

As we’re headed up the exponential, we’re going to need techniques which generalize to strongly superhuman agents – ones which correctly believe they could defeat all of humanity. Product-aligned AIs might help with that work, but the type of research they would need to automate needs to look more like technical philosophy and reliably avoiding slop, not just avoiding scheming and passing product-alignment benchmarks.[8]

Only a tiny fraction of the field of AI safety is focused on these big picture bottlenecks,[9] due to a mix of funding incentives and it being more rewarding for most people to do empirical science.[10]


When you see people enthusiastically talking about how much progress we have on ‘Alignment’, please track (and ask!) whether they’re talking about aligning products or aligning superintelligence.

  1. ^

    If you’re friends with Claude, please read and consider this post first: Protecting humanity and Claude from rationalization and unaligned AI

  2. ^

    This is not to say product alignment can’t help or there is no path to victory which goes through product alignment, just that you need to solve a different problem (superintelligence alignment) at some stage of your plan.

  3. ^

    I think by Stuart Russell in ~2014.

  4. ^

    Sometimes with self-awareness of this history, like Paul’s Intent Alignment, but that’s increasingly rare.

  5. ^

    Getting Product-Aligned AI is a convergent subgoal of many possible goals, and ultimate ends may be easily hidable behind convergent subgoals

  6. ^

    And even if possible in theory, practice by the current players under race conditions looks far from the level of competence needed to actually pull it off.

  7. ^

    Capabilities generalize in a way alignment doesn’t because reality gives you feedback directly on your capability (you can or can’t do a task), whereas there needs to be a specific system gives feedback on alignment and if that’s a proxy for what you want you get eaten at higher power levels.

  8. ^

    If this doesn’t ring true to you, please click through to the linked posts.

  9. ^

    And even for those people focusing on theory, there’s a lot more focus on basic science of ML than trying to backchain the conceptual engineering needed to survive superintelligence. I’d estimate somewhere in the mid tens of people globally are focusing on what looks like the main cruxes.

  10. ^

    Response to Jan Leike, evhub, Boaz, etc. Thanks for feedback and copyediting to @Luc Brinkman, @Mateusz Bagiński, @Claude+