continue working on hard alignment! don’t give up!

Link post

let’s call “hard alignment” the (“orthodox”) problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don’t care about by default and destroying everything of value to us on the way there. let’s call “easy” alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.

what should one be working on? as always, the calculation consists of comparing

  • p(hard) × how much value we can get in hard

  • p(easy) × how much value we can get in easy

given how AI capabilities are going, it’s not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it’s not we’re doomed anyways. but i think, in this particular case, this is wrong.

this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.

to me, p(hard) is big enough, and my hard-compatible plan seems workable enough, that it makes sense for me to continue to work on it.

let’s not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.