AU theory describes how people feel impacted. I’m darn confident (95%) that this is true.
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.
CCC is true. Fairly confident (70%). There seems to be a dichotomy between “catastrophe directly incentivized by goal” and “catastrophe indirectly incentivized by goal through power-seeking”, although Vika provides intuitions in the other direction.
AUPconceptual prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).
I’ve updated the post with epistemic statuses:
AU theory describes how people feel impacted. I’m darn confident (95%) that this is true.
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.
CCC is true. Fairly confident (70%). There seems to be a dichotomy between “catastrophe directly incentivized by goal” and “catastrophe indirectly incentivized by goal through power-seeking”, although Vika provides intuitions in the other direction.
AUPconceptual prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).