I think that if a transient period of instability is exogenously bounded (like a snowstorm that passes regardless of our actions), fully observable (we can verify with certainty when the goal is achieved), and non-adversarial, then the modeling does become easier because then it is just a temporary empiric phenomena and doesn’t imply intractable planning problems.
I’m not familiar with all longtermist causes, but I could see something like asteroid impact mitigation, for a specific asteroid we’ve found, fitting quite cleanly into this category. The horizon is finite, the state is observable and the physics are non-adverserial.
However; the undecidability results do apply to any causes with features of endogenous termination, partial observability, adversarial dynamics, and semantic verification (we need to check the behavioral properties of complex systems). Something like AI alignment has all four of these features to a tee.
To return to the notion of a transient period being empirically short (say 10 years): if we model it as a fixed horizon T I see a problem because the expected duration of a process is not the same as the termination condition of a control problem.
For a finite-horizon Markov decision process this would execute a for loop (“run for 10 years”) and termination is exogenous because T is a fixed parameter.
In comparison, longtermist causes have a termination that is endogenous. They run a while loop (“run until alignment is guaranteed”) and T is endogenous. The stopping time is determined by whether the trajectory reaches an absorbing state (say paperclips or utopia.) Expecting that the period of instability is 10 years does not change the formal problem class because we are still trying to figure out a decision problem about which policy lets us end up in S_utopia rather than S_paperclip. This is the Goal Reachability structure and Madani, 2003 shows that the decision problem “Does there exist a policy reaching a goal state with probability >p?” is undecidable for POMDPs. The fact that the transient period will terminate at some point says nothing whether we can verify which policy ensures we end in the desired state.
While the planning problem is undecidable, a reasonable objection could be that we just want to verify when we have reached S_utopia and then we can stop. Unfortunately, the verification problem also has an undecidability barrier because verifying S_utopia means checking a semantic property. Rice’s Theorem shows that non-trivial semantic properties are undecidable for systems of sufficient computational expressiveness. Several recent works suggest AI systems fit this bill: Schuurmans in 2023 proved that LLMs with external memory can “exactly simulate the execution of a universal Turing machine.” and Feng in 2024 showed that for any computable function there exists a prompt that causes a finite-size Transformer to compute it. So agentic AI systems are essentially Turing-complete and fall under Rice’s Theorem.
(Also I think for the specific case of AI safety from the longtermist perspective suffers from being the hard case by default as a misaligned AI is an optimizer searching for gaps in the verification procedure.)
Finally, if we decide to just arbitrarily cut the horizon down, say to 10 years, estimate a terminal value function V(T), and try to optimize against that, this relocates the undecidability issue. To compute V(T) accurately we must estimate the probability of being in a safe rather than latent descriptive state at time T. However, making this determination is the same undecidable problem, it’s just now in the terminal value function. I think we end up with truncation bias: any solver using a heuristic for V(T) will prefer premature stabilization (a suboptimal but observable state, like totalitarianism) to avoid the uncertainty of the while loop.
So overall, I think even if we grant the empirical hypothesis for the transient period of instability, the problem remains undecidable. In general, any risk that requires verifying a semantic property of a Turing-complete, or effectively universal system, faces the Rice’s Theorem barrier.
I believe longtermists are left with the same options I mentioned in the paper: prove the POMDPs they care about somehow fall into a decidable subclass or concede that they are just using heuristics that don’t grant normative authority.
While I think this is a broadly reasonable response, I’m curious what you think is able to provide better public justification than longtermism. These results seem to apply fairly broadly to any realistic EV-based justification for action given that partial observability is very much the rule.
I genuinely don’t know. It’s out of my depth to try to sensibly answer that. I think it’s sometimes easier to see the error in something than the solution.
All the same, I have niggling fear that LTist reasoning as practiced by MacAskill and others rests on a base with very serious problems. That’s not minor when the future of the universe is being decided.
In contrast, I totally believe that EA efforts like distributing malaria nets is a wonderful and sensible idea.
Thanks for the clarification!
I think that if a transient period of instability is exogenously bounded (like a snowstorm that passes regardless of our actions), fully observable (we can verify with certainty when the goal is achieved), and non-adversarial, then the modeling does become easier because then it is just a temporary empiric phenomena and doesn’t imply intractable planning problems.
I’m not familiar with all longtermist causes, but I could see something like asteroid impact mitigation, for a specific asteroid we’ve found, fitting quite cleanly into this category. The horizon is finite, the state is observable and the physics are non-adverserial.
However; the undecidability results do apply to any causes with features of endogenous termination, partial observability, adversarial dynamics, and semantic verification (we need to check the behavioral properties of complex systems). Something like AI alignment has all four of these features to a tee.
To return to the notion of a transient period being empirically short (say 10 years): if we model it as a fixed horizon T I see a problem because the expected duration of a process is not the same as the termination condition of a control problem.
For a finite-horizon Markov decision process this would execute a for loop (“run for 10 years”) and termination is exogenous because T is a fixed parameter.
In comparison, longtermist causes have a termination that is endogenous. They run a while loop (“run until alignment is guaranteed”) and T is endogenous. The stopping time is determined by whether the trajectory reaches an absorbing state (say paperclips or utopia.) Expecting that the period of instability is 10 years does not change the formal problem class because we are still trying to figure out a decision problem about which policy lets us end up in S_utopia rather than S_paperclip. This is the Goal Reachability structure and Madani, 2003 shows that the decision problem “Does there exist a policy reaching a goal state with probability >p?” is undecidable for POMDPs. The fact that the transient period will terminate at some point says nothing whether we can verify which policy ensures we end in the desired state.
While the planning problem is undecidable, a reasonable objection could be that we just want to verify when we have reached S_utopia and then we can stop. Unfortunately, the verification problem also has an undecidability barrier because verifying S_utopia means checking a semantic property. Rice’s Theorem shows that non-trivial semantic properties are undecidable for systems of sufficient computational expressiveness. Several recent works suggest AI systems fit this bill: Schuurmans in 2023 proved that LLMs with external memory can “exactly simulate the execution of a universal Turing machine.” and Feng in 2024 showed that for any computable function there exists a prompt that causes a finite-size Transformer to compute it. So agentic AI systems are essentially Turing-complete and fall under Rice’s Theorem.
(Also I think for the specific case of AI safety from the longtermist perspective suffers from being the hard case by default as a misaligned AI is an optimizer searching for gaps in the verification procedure.)
Finally, if we decide to just arbitrarily cut the horizon down, say to 10 years, estimate a terminal value function V(T), and try to optimize against that, this relocates the undecidability issue. To compute V(T) accurately we must estimate the probability of being in a safe rather than latent descriptive state at time T. However, making this determination is the same undecidable problem, it’s just now in the terminal value function. I think we end up with truncation bias: any solver using a heuristic for V(T) will prefer premature stabilization (a suboptimal but observable state, like totalitarianism) to avoid the uncertainty of the while loop.
So overall, I think even if we grant the empirical hypothesis for the transient period of instability, the problem remains undecidable. In general, any risk that requires verifying a semantic property of a Turing-complete, or effectively universal system, faces the Rice’s Theorem barrier.
I believe longtermists are left with the same options I mentioned in the paper: prove the POMDPs they care about somehow fall into a decidable subclass or concede that they are just using heuristics that don’t grant normative authority.
While I think this is a broadly reasonable response, I’m curious what you think is able to provide better public justification than longtermism. These results seem to apply fairly broadly to any realistic EV-based justification for action given that partial observability is very much the rule.
I genuinely don’t know. It’s out of my depth to try to sensibly answer that. I think it’s sometimes easier to see the error in something than the solution.
All the same, I have niggling fear that LTist reasoning as practiced by MacAskill and others rests on a base with very serious problems. That’s not minor when the future of the universe is being decided.
In contrast, I totally believe that EA efforts like distributing malaria nets is a wonderful and sensible idea.
So in summary, not sure.
@David Johnston I wrote a piece of my own philosophical thoughts (it doesn’t answer the longtermist question though.) I suspect you might really dislike, and disagree, with the essay, but your comments are so good and sharp. If you are ever in the mood (and there is also a small chance it will resonate) and wanted to help me think more clearly: https://www.lesswrong.com/posts/zANA2aJzTQDJutguA/join-my-new-movement-for-the-post-ai-world