Some Thoughts on Metaphilosophy

A powerful AI (or human-AI civilization) guided by wrong philosophical ideas would likely cause astronomical (or beyond astronomical) waste. Solving metaphilosophy is one way in which we can hope to avoid this kind of disaster. For my previous thoughts on this topic and further motivation see Metaphilosophical Mysteries, The Argument from Philosophical Difficulty, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety.

Some interrelated ways of looking at philosophy

Philosophy as answering confusing questions

This was my starting point for thinking about what philosophy is: it’s what we do when we try to answer confusing questions, or questions that we don’t have any other established methodology for answering. Why do we find some questions confusing, or lack methods for answering them? This leads to my next thought.

Philosophy as ability to generalize /​ handle distributional shifts

ML systems tend to have a lot of trouble dealing with distributional shifts. (It seems to be a root cause of many AI as well as human safety problems.) But humans seem to have some way of (sometimes) noticing out-of-distribution inputs, and can feel confused instead of just confidently using their existing training to respond to it. This is perhaps most obvious in unfamiliar ethical situations like Torture vs Dust Specks or trying to determine whether our moral circle should include things like insects and RL algorithms. Unlike ML algorithms that extrapolate in an essentially random way when given out-of-distribution inputs, humans can potentially generalize in a principled or correct way, by using philosophical reasoning.

Philosophy as slow but general purpose problem solving

Philosophy may even be a fully general purpose problem solving technique. At least we don’t seem to have reason to think that it’s not. The problem is that it’s painfully slow and resource intensive. Individual humans acting alone seem to have little chance of achieving justifiably high confidence in many philosophical problems even if they devote their entire lives to those problems. Humanity has been collectively trying to solve some philosophical problems for hundreds or even thousands of years, without arriving at final solutions. The slowness of philosophy explains why distributional shifts remain a safety problem for humans, even though we seemingly have a general way of handling them.

Philosophy as meta problem solving

Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math problems, we use it to solve logic and philosophy of math. Instead of using philosophy to solve individual decision problems, we use it to solve decision theory. Instead of using philosophy to solve individual philosophical problems, we can try to use it to solve metaphilosophy.

Philosophy as “high computational complexity class”

If philosophy can solve any problem within a very large class, then it must have a “computational complexity class” that’s as high as any given problem within that class. Computational complexity can be measured in various ways, such as time and space complexity (on various actual machines or models of computation), whether and how high a problem is in the polynomial hierarchy, etc. “Computational complexity” of human problems can also be measured in various ways, such as how long it would take to solve a given problem using a specific human, group of humans, or model of human organizations or civilization, and whether and how many rounds of DEBATE would be sufficient to solve that problem either theoretically (given infinite computing power) or in practice.

The point here is that no matter how we measure complexity, it seems likely that philosophy would have a “high computational complexity class” according to that measure.

Philosophy as interminable debate

The visible aspects of philosophy (as traditionally done) seem to resemble an endless (both in clock time and in the number of rounds) game of debate, where people propose new ideas, arguments, counterarguments, counter-counterarguments, and so on, and at the same time to try judge proposed solutions based on these ideas and arguments. People sometimes complain about the interminable nature of philosophical discussions, but that now seems understandable if philosophy is a “high computational complexity” method of general purpose problem solving.

In a sense, philosophy is the opposite of math: whereas in math any debate can be settled by producing a proof (hence analogous to the complexity class NP) (in practice maybe a couple more rounds is needed of people finding or fixing flaws in the proof), potentially no fixed number of rounds of debate (or DEBATE) is enough to settle all philosophical problems.

Philosophy as Jürgen Schmidhuber’s General TM

Unlike traditional Turing Machines, a General TM or GTM may edit their previous outputs, and can be considered to solve a problem even if it never terminates, as long as it stops editing its output after a finite number of edits and the final output is the correct solution. So if a GTM solves a certain problem, you know that it will eventually converge to the right solution, but you have no idea when, or if what’s on its output tape at any given moment is the right solution. This seems a lot of like philosophy, where people can keep changing their minds (or adjust their credences) based on an endless stream of new ideas, arguments, counterarguments, and so on, and you never really know when you’ve arrived at a correct answer.

What to do until we solve metaphilosophy?

Protect the trajectory?

What would you do if you had a GTM that could solve a bunch of really important problems, and that was the only method you had of solving them? You’d try to reverse-engineer it and make a bunch of copies. But if you couldn’t do that, then you’d want to put layers and layers of protection around it. Applied to philosophy, this line of thought seems to lead to the familiar ideas of using global coordination (or a decisive strategic advantage) to stop technological progress, or having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.

Replicate the trajectory with ML?

Another idea is to try to build a good enough approximation of the GTM by training ML on its observable behavior (including whatever work tapes you have read access to). But there are two problems with this: 1. This is really hard or impossible to do if the GTM has internal state that you can’t observe. And 2. If you haven’t already reverse engineered the GTM, there’s no good way to know that you’ve built a good enough approximation, i.e., to know that the ML model won’t end up converging to answers that are different from the GTM.

A three part model of philosophical reasoning

It may be easier to understand the difficulty of capturing philosophical reasoning with ML by considering a more concrete model. I suggest we can divide it into three parts as follows: A. Propose new ideas/​arguments/​counterarguments/​etc. according to some (implicit) distribution. B. Evaluate existing ideas/​arguments/​counterarguments/​etc. C. Based on past ideas/​arguments/​counterarguments/​etc., update some hidden state that changes how one does A and B. It’s tempting to think that building an approximation of B using ML perhaps isn’t too difficult, and then we can just search for the “best” ideas/​arguments/​counterarguments/​etc. using standard optimization algorithms (maybe with some safety precautions like trying to avoid adversarial examples for the learned model). There’s some chance this could work out well, but without having a deeper understanding of metaphilosophy, I don’t see how we can be confident that throwing out A and C won’t lead to disaster, especially in the long run. But A and C seem very hard or impossible for ML to capture (A due to paucity of training data, and C due to the unobservable state).

Is there a way around this difficulty? What else can we do in the absence of a full white-box solution to metaphilosophy?