AU theory describes how people feel impacted. I’m darn confident (95%) that this is true.
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.
AUPconceptual prevents catastrophe, assuming the catastrophic convergence conjecture. Very confident (85%).
Some version of AUP solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents. Leaning towards yes (65%).
For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).
Speaking of claims made in 2019 review posts: Conclusion to ‘Reframing Impact’ (the final post of my nominated Reframing Impact sequence) contains the following claims and credences:
Ey, awesome! I’ve updated the post to include them.