This sequence takes on the perspective of an AI system that is well-intentioned, but lacking information about what humans want. The hope is to find what good AI reasoning might look like, and hopefully use this to derive insights for safety. The sequence considers Goodhart problems, adversarial examples, distribution shift, subagent problems, etc.
Planned opinion:
I liked this sequence. Often when presented with a potential problem in AI safety, I ask myself why the problem doesn’t also apply to humans, and how humans have managed to solve the problem. This sequence was primarily this sort of reasoning, and I think it did a good job of highlighting how with sufficient conservatism it seems plausible that many problems are not that bad if the AI is well-intentioned, even if it has very little information, or finds it hard to communicate with humans, or has the wrong abstractions.
Planned summary for the Alignment Newsletter:
Planned opinion:
Sounds good, cheers!