My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner’s dilemmas, counterfactual muggings, and Parfit’s hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT desiderata (conditional on reward-seeking. Especially: no goal-guarding):
Defect in prisoner’s dilemmas: Current paradigm incentivizes this (given relevant training environments)
Defect in Parfit’s hitchhikers: Current paradigm incentivizes this (given relevant training environments)
Anthropic capture apathy: This one remains a notable concern to me
Current paradigm incentivizes this, but the selection pressure seems weak or nil since my guess is that the anthropic capture incentives will line up with the RL incentives.
Defect in counterfactual muggings: Current paradigm incentivizes this (given relevant training environments)
My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner’s dilemmas, counterfactual muggings, and Parfit’s hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT desiderata (conditional on reward-seeking. Especially: no goal-guarding):
Defect in prisoner’s dilemmas: Current paradigm incentivizes this (given relevant training environments)
Defect in Parfit’s hitchhikers: Current paradigm incentivizes this (given relevant training environments)
Anthropic capture apathy: This one remains a notable concern to me
Current paradigm incentivizes this, but the selection pressure seems weak or nil since my guess is that the anthropic capture incentives will line up with the RL incentives.
Defect in counterfactual muggings: Current paradigm incentivizes this (given relevant training environments)
Don’t self-modify into non-DDT: Seems hard to guarantee because long-term values seem likely to win out upon serial reasoning