TurnTrout comments on Looking back on my alignment PhD

TurnTrout 14 Nov 2022 23:27 UTC
8 points
2
I have new inferences about why I didn’t realize AI alignment thinking seemed doomed:
After I became more optimistic about alignment due to having a sharper understanding of the overall problem and of how human values formed to begin with, I also became more pessimistic about other approaches, like IDA/ELK/RRM/AUP/[anything else with a three-letter acronym]. But my new understanding didn’t seem to present any specific objections. So why did I suddenly feel worse about these older ideas?
I suspect that part of the explanation is: I hadn’t wanted to admit how confused I was about alignment, and I (implicitly) clutched to “but it could work”-style hopefulness. But now that I had a different reason to hope, resting upon a more solid and mechanistic understanding, now it was apparently emotionally safe for me to admit I didn’t have much hope at all for the older approaches.
Yikes.
If that’s what happened, I was seriously deluding myself. I will do better next time.
I think I was, in fact, deluding myself.
But also I think that, in better understanding the alignment problem, I implicitly realized the inappropriateness of much outer/inner alignment reasoning. Sentences which used to seem short and descriptive (e.g. “get the AI to care about a robust outer objective”) became long and confused in my new ontology. The outer/inner frame, unfortunately, underlies quite a lot of AI alignment thinking.
I am sad and concerned that I haven’t been able to deeply communicate this insight to many existing researchers, but I’m going to keep trying.