wassname comments on wassname’s Shortform

wassname 11 Oct 2024 5:31 UTC

5 points

A one-sentence guide to technical AI alignment ideas

Epistemic status: excessive lossy compression applied

LessWrong has some great technical and critical overviews of alignment agendas, but for many readers they take too long to read. This is my attempt at cartoonish compression.

The shape that keeps recurring

A lot of alignment proposals boil down to: use AI to help supervise AI.

This might be the only thing that scales. It’s worth noticing how often the pattern appears.

Use AI to supervise AI

One-sentence version	Examples
Have a weaker AI help supervise the stronger one, recursively	Iterated Amplification
Have two AIs argue; truth is easier to verify than to generate	Debate
Give the AI principles and have it critique itself	Constitutional AI
Build AI that helps with alignment research	Superalignment

Don’t build one big AI

One-sentence version	Examples
Many narrow AIs with checks, like an organization	Drexler’s Open Agency Model
Train in a simulation you understand, then transfer carefully	davidad’s plan
Make constraints feel like physics to the AI	Boundaries / membranes

Make the objective less wrong

One-sentence version	Examples
Make the AI uncertain about what we want, so it keeps asking	CIRL / assistance games
Don’t optimize so hard that small errors become disasters	Mild optimization / satisficing

Build control tools

One-sentence version	Examples
Find what the model knows, even when it won’t say	ELK / elicitation
Change internal states when prompting fails	Activation steering

Understand what we’re building

One-sentence version	Examples
Clarify what “agent” and “optimization” actually mean	Agent foundations

Older ideas (still discussed, less active)

One-sentence version	Examples
Keep it contained while we figure it out	AI boxing
Let it answer questions, not take actions	Oracle AI
Do what we’d want if we were wiser	CEV

These haven’t been abandoned because they’re bad ideas—more that they don’t obviously solve the core problem: how do you verify alignment in systems smarter than you?

If you want depth

I’ve left out the many debates over the proposals. You need to dig deeper to judge which methods will work:

What links here?

An Aphoristic Overview of Technical AI Alignment proposals by wassname (5 Jan 2026 3:01 UTC; 11 points)

wassname 11 Oct 2024 5:31 UTC
2 points
−1
Parent
If anyone finds this useful, please let me know. I’ve abandoned it because none of my test audience found it interesting or useful. That’s OK, it just means it’s better to focus on other things.
- wassname 11 Oct 2024 6:03 UTC
  1 point
  0
  Parent
  In particular, I’d be keen to know what @Stag and @technicalities think, as this was in large part inspired by the desire to further simplify and categorise the “one sentence summaries” from their excellent Shallow review of live agendas in alignment & safety