Winding My Way Through Alignment

A sequence of my alignment distillations, written up as I work my way through understanding AI alignment theory.

My rough guiding research algorithm is to focus in on the biggest hazard in my current model of alignment, try to understand and explain that hazard and the proposed solutions to it, and then recurse.

This now-finished sequence is representative of my developing alignment model before I became substantially informationally entangled, in-person, with the Berkeley alignment community. It’s what I was able to glean from just reading a lot online.

HCH and Ad­ver­sar­ial Questions

Agency and Coherence

De­cep­tive Agents are a Good Way to Do Things

But What’s Your *New Align­ment In­sight,* out of a Fu­ture-Text­book Para­graph?

Gato as the Dawn of Early AGI

In­tel­li­gence in Com­mit­ment Races

How Deadly Will Roughly-Hu­man-Level AGI Be?