An Aphoristic Overview of Technical AI Alignment proposals

Alignment Aphorisms

Most alignment overviews are too long, but what if we rewrote one as a series of aphorisms?

I like Epictetus’s confronting style: abrasive, clarifying. See my fuller post for links and nuance.


I.

Some problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.

So many proposals collapse to:
use AI to help supervise AI.
It sounds too simple. It’s the only thing that scales.


II. On using AI to supervise AI

A teacher can grade a student slightly smarter than herself.
But not much smarter—unless she has tricks.

How does weak supervise strong?
Hard problems decompose.
Supervise each step; align the whole.
This is Iterated Amplification.

If two models argue, a weaker one can judge:
judging is easier than creating.
This is Debate.

Give the model principles and let it judge its own actions:
principles generalize where labels cannot.
This is Constitutional AI.

They already know what’s good.
Weak teaching can unlock it.
This is Weak-to-Strong Generalization.


III. On not building one big AI

A corporation can be smarter than any employee.
Yet no employee optimizes the world.

Many bounded tools, no unified goal.
No unified agent; no unified misalignment.
This is CAIS.


IV. On making the objective less wrong

Make the AI uncertain about what we want.
Then it must ask.
Uncertainty makes cooperation optimal.
This is CIRL.

If it can’t be satisfied, it won’t stop.
Give it a notion of “enough”, and it can rest.
A satisficer has no incentive to rewrite itself.
This is satisficing.


V. On control tools

Align them to want what we want.
Control will catch when they don’t.
Both are needed. The weaker model needs eyes inside.

If the model lies, look past the output.
A weaker model can supervise a stronger one—if it can read its mind.
This is ELK.

Like us, models say one thing but value another.
Prompting shapes words. Steering shapes thoughts.
This is steering.

When models outsmart the weaker one,
their words can deceive.
See the reasons, not the results. Even the hidden thoughts.
This is interpretability.


VI. On understanding what we build

Tricks that work now might break at scale.
Understanding doesn’t.

What is an agent? What is optimization?
If you can’t define it, you can’t align it.
This is agent foundations.


VII. On older proposals

Keep it in a box. (AI boxing)
But boxes leak.

Let it answer, not act. (Oracle AI)
But answers shape action.

Do what we’d want if we were wiser. (CEV)
But wiser toward what?


VIII. Beyond alignment (bonus)

The optimist case: AIs are white boxes. Brains are not.
Values are simple. They’re learned early.
This is the optimist case.


IX. On who aligns the aligners (bonus)

Suppose we solve alignment perfectly.
Aligned to whom?

A safe AI in the wrong hands is still a problem.
This is governance risk.


Rewritten from my original draft with Claude. Compressed from Shallow review of live agendas and others.