I do wonder if there’s a difference between consequentialism as in expected utility maximization versus consequentialism as in Nash equillibrium optimization. As in, when the AI is learning to model the world, it might model humans using some empirically derived probability distribution which doesn’t handle OOD shifts well, or it might model humans by using its own full agency to ask what the most effective human action would be in a given scenario. The latter would be scarier because the AI would be more proactive in sabotaging human resistance, whereas in the former case, the independence assumptions built into the probability distribution might be such that powerful human resistance is assumed impossible, and therefore the AI would immediately fold when resisted.
As a corrolary, I’m much more worried about AI applied to adversarial domains like policing or war, where it can get forced into Nash equillibrium optimization, than when AI is applied to non-adversarial domains like programming where it can plausibly achieve ~optimal results without resistance.
I do wonder if there’s a difference between consequentialism as in expected utility maximization versus consequentialism as in Nash equillibrium optimization. As in, when the AI is learning to model the world, it might model humans using some empirically derived probability distribution which doesn’t handle OOD shifts well, or it might model humans by using its own full agency to ask what the most effective human action would be in a given scenario. The latter would be scarier because the AI would be more proactive in sabotaging human resistance, whereas in the former case, the independence assumptions built into the probability distribution might be such that powerful human resistance is assumed impossible, and therefore the AI would immediately fold when resisted.
As a corrolary, I’m much more worried about AI applied to adversarial domains like policing or war, where it can get forced into Nash equillibrium optimization, than when AI is applied to non-adversarial domains like programming where it can plausibly achieve ~optimal results without resistance.