TurnTrout comments on Conclusion to ‘Reframing Impact’

TurnTrout 24 May 2020 15:51 UTC
LW: 14 AF: 6
0
AF
I’ve updated the post with epistemic statuses:
- AU theory describes how people feel impacted. I’m darn confident (95%) that this is true.
- Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.
- CCC is true. Fairly confident (70%). There seems to be a dichotomy between “catastrophe directly incentivized by goal” and “catastrophe indirectly incentivized by goal through power-seeking”, although Vika provides intuitions in the other direction.
- AUP $_{conceptual}$ prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
- Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
- For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
- There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).