Work supported by a Manifund Grant titled Alignment is hard.
While many people have made the claim that the alignment problem is hard in an engineering sense, this paper makes the argument that the alignment problem is impossible in at least one case in a theoretical computer science sense. The argument being formalized is that if we can’t prove a program will loop forever, we can’t prove an agent will care about us forever. More Formally, when the agent’s environment can be modeled with discrete time, the agent’s architecture is agentically-turing complete and the agent’s code is immutable, testing the agent’s alignment is CoRE-Hard if the alignment schema is Demon-having, angel-having, universally-betrayal-sensitive, perfect and thought-apathetic. Further research could be done to change most assumptions in that argument other than the immutable code.
This is my first major paper on alignment. Since there isn’t really an alignment Journal, I’m aiming to have this post act as a peer review step, but forums are weird. Getting the formatting right seems dubious, so I’m posting the abstract and linking the pdf.