Alignment Stream of Thought

This sequence contains posts that are lower effort than my usual posts, where instead of thinking things all the way through before posting something polished about them, I instead post things that are rough and in-progress as I think about them. I’m trying this because I noticed that I had lots of interesting thoughts that I didn’t want to share due to not having totally figured it out yet, and that the process of writing things and posting them often helps me make progress.

Anything in this sequence is at even greater risk than usual of being obsoleted or unendorsed later down the road. It will also be more difficult to follow than usual, because I’m not putting as much effort as usual into explaining background.

I am hoping to eventually do a distillation of the important insights of this sequence into more legible post(s) once I’m less confused.

[ASoT] Ob­ser­va­tions about ELK

[ASoT] Some ways ELK could still be solv­able in practice

[ASoT] Search­ing for con­se­quen­tial­ist structure

[ASoT] Some thoughts about de­cep­tive mesaoptimization

[ASoT] Some thoughts about LM monologue limi­ta­tions and ELK

[ASoT] Some thoughts about im­perfect world modeling

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

Hu­mans Reflect­ing on HRH

Towards de­con­fus­ing wire­head­ing and re­ward maximization