Some Thoughts on Metaphilosophy

A pow­er­ful AI (or hu­man-AI civ­i­liza­tion) guided by wrong philo­soph­i­cal ideas would likely cause as­tro­nom­i­cal (or be­yond as­tro­nom­i­cal) waste. Solv­ing metaphilos­o­phy is one way in which we can hope to avoid this kind of dis­aster. For my pre­vi­ous thoughts on this topic and fur­ther mo­ti­va­tion see Me­taphilo­soph­i­cal Mys­ter­ies, The Ar­gu­ment from Philo­soph­i­cal Difficulty, Three AI Safety Re­lated Ideas, and Two Ne­glected Prob­lems in Hu­man-AI Safety.

Some in­ter­re­lated ways of look­ing at philosophy

Philos­o­phy as an­swer­ing con­fus­ing questions

This was my start­ing point for think­ing about what philos­o­phy is: it’s what we do when we try to an­swer con­fus­ing ques­tions, or ques­tions that we don’t have any other es­tab­lished method­ol­ogy for an­swer­ing. Why do we find some ques­tions con­fus­ing, or lack meth­ods for an­swer­ing them? This leads to my next thought.

Philos­o­phy as abil­ity to gen­er­al­ize /​ han­dle dis­tri­bu­tional shifts

ML sys­tems tend to have a lot of trou­ble deal­ing with dis­tri­bu­tional shifts. (It seems to be a root cause of many AI as well as hu­man safety prob­lems.) But hu­mans seem to have some way of (some­times) notic­ing out-of-dis­tri­bu­tion in­puts, and can feel con­fused in­stead of just con­fi­dently us­ing their ex­ist­ing train­ing to re­spond to it. This is per­haps most ob­vi­ous in un­fa­mil­iar eth­i­cal situ­a­tions like Tor­ture vs Dust Specks or try­ing to de­ter­mine whether our moral cir­cle should in­clude things like in­sects and RL al­gorithms. Un­like ML al­gorithms that ex­trap­o­late in an es­sen­tially ran­dom way when given out-of-dis­tri­bu­tion in­puts, hu­mans can po­ten­tially gen­er­al­ize in a prin­ci­pled or cor­rect way, by us­ing philo­soph­i­cal rea­son­ing.

Philos­o­phy as slow but gen­eral pur­pose prob­lem solving

Philos­o­phy may even be a fully gen­eral pur­pose prob­lem solv­ing tech­nique. At least we don’t seem to have rea­son to think that it’s not. The prob­lem is that it’s painfully slow and re­source in­ten­sive. In­di­vi­d­ual hu­mans act­ing alone seem to have lit­tle chance of achiev­ing jus­tifi­ably high con­fi­dence in many philo­soph­i­cal prob­lems even if they de­vote their en­tire lives to those prob­lems. Hu­man­ity has been col­lec­tively try­ing to solve some philo­soph­i­cal prob­lems for hun­dreds or even thou­sands of years, with­out ar­riv­ing at fi­nal solu­tions. The slow­ness of philos­o­phy ex­plains why dis­tri­bu­tional shifts re­main a safety prob­lem for hu­mans, even though we seem­ingly have a gen­eral way of han­dling them.

Philos­o­phy as meta prob­lem solving

Given that philos­o­phy is ex­tremely slow, it makes sense to use it to solve meta prob­lems (i.e., find­ing faster ways to han­dle some class of prob­lems) in­stead of ob­ject level prob­lems. This is ex­actly what hap­pened his­tor­i­cally. In­stead of us­ing philos­o­phy to solve in­di­vi­d­ual sci­en­tific prob­lems (nat­u­ral philos­o­phy) we use it to solve sci­ence as a method­olog­i­cal prob­lem (philos­o­phy of sci­ence). In­stead of us­ing philos­o­phy to solve in­di­vi­d­ual math prob­lems, we use it to solve logic and philos­o­phy of math. In­stead of us­ing philos­o­phy to solve in­di­vi­d­ual de­ci­sion prob­lems, we use it to solve de­ci­sion the­ory. In­stead of us­ing philos­o­phy to solve in­di­vi­d­ual philo­soph­i­cal prob­lems, we can try to use it to solve metaphilos­o­phy.

Philos­o­phy as “high com­pu­ta­tional com­plex­ity class”

If philos­o­phy can solve any prob­lem within a very large class, then it must have a “com­pu­ta­tional com­plex­ity class” that’s as high as any given prob­lem within that class. Com­pu­ta­tional com­plex­ity can be mea­sured in var­i­ous ways, such as time and space com­plex­ity (on var­i­ous ac­tual ma­chines or mod­els of com­pu­ta­tion), whether and how high a prob­lem is in the polyno­mial hi­er­ar­chy, etc. “Com­pu­ta­tional com­plex­ity” of hu­man prob­lems can also be mea­sured in var­i­ous ways, such as how long it would take to solve a given prob­lem us­ing a spe­cific hu­man, group of hu­mans, or model of hu­man or­ga­ni­za­tions or civ­i­liza­tion, and whether and how many rounds of DEBATE would be suffi­cient to solve that prob­lem ei­ther the­o­ret­i­cally (given in­finite com­put­ing power) or in prac­tice.

The point here is that no mat­ter how we mea­sure com­plex­ity, it seems likely that philos­o­phy would have a “high com­pu­ta­tional com­plex­ity class” ac­cord­ing to that mea­sure.

Philos­o­phy as in­ter­minable debate

The visi­ble as­pects of philos­o­phy (as tra­di­tion­ally done) seem to re­sem­ble an end­less (both in clock time and in the num­ber of rounds) game of de­bate, where peo­ple pro­pose new ideas, ar­gu­ments, coun­ter­ar­gu­ments, counter-coun­ter­ar­gu­ments, and so on, and at the same time to try judge pro­posed solu­tions based on these ideas and ar­gu­ments. Peo­ple some­times com­plain about the in­ter­minable na­ture of philo­soph­i­cal dis­cus­sions, but that now seems un­der­stand­able if philos­o­phy is a “high com­pu­ta­tional com­plex­ity” method of gen­eral pur­pose prob­lem solv­ing.

In a sense, philos­o­phy is the op­po­site of math: whereas in math any de­bate can be set­tled by pro­duc­ing a proof (hence analo­gous to the com­plex­ity class NP) (in prac­tice maybe a cou­ple more rounds is needed of peo­ple find­ing or fix­ing flaws in the proof), po­ten­tially no fixed num­ber of rounds of de­bate (or DEBATE) is enough to set­tle all philo­soph­i­cal prob­lems.

Philos­o­phy as Jür­gen Sch­mid­hu­ber’s Gen­eral TM

Un­like tra­di­tional Tur­ing Machines, a Gen­eral TM or GTM may edit their pre­vi­ous out­puts, and can be con­sid­ered to solve a prob­lem even if it never ter­mi­nates, as long as it stops edit­ing its out­put af­ter a finite num­ber of ed­its and the fi­nal out­put is the cor­rect solu­tion. So if a GTM solves a cer­tain prob­lem, you know that it will even­tu­ally con­verge to the right solu­tion, but you have no idea when, or if what’s on its out­put tape at any given mo­ment is the right solu­tion. This seems a lot of like philos­o­phy, where peo­ple can keep chang­ing their minds (or ad­just their cre­dences) based on an end­less stream of new ideas, ar­gu­ments, coun­ter­ar­gu­ments, and so on, and you never re­ally know when you’ve ar­rived at a cor­rect an­swer.

What to do un­til we solve metaphilos­o­phy?

Pro­tect the tra­jec­tory?

What would you do if you had a GTM that could solve a bunch of re­ally im­por­tant prob­lems, and that was the only method you had of solv­ing them? You’d try to re­verse-en­g­ineer it and make a bunch of copies. But if you couldn’t do that, then you’d want to put lay­ers and lay­ers of pro­tec­tion around it. Ap­plied to philos­o­phy, this line of thought seems to lead to the fa­mil­iar ideas of us­ing global co­or­di­na­tion (or a de­ci­sive strate­gic ad­van­tage) to stop tech­nolog­i­cal progress, or hav­ing AIs de­rive their ter­mi­nal goals from simu­lated hu­mans who live in a safe vir­tual en­vi­ron­ment.

Repli­cate the tra­jec­tory with ML?

Another idea is to try to build a good enough ap­prox­i­ma­tion of the GTM by train­ing ML on its ob­serv­able be­hav­ior (in­clud­ing what­ever work tapes you have read ac­cess to). But there are two prob­lems with this: 1. This is re­ally hard or im­pos­si­ble to do if the GTM has in­ter­nal state that you can’t ob­serve. And 2. If you haven’t already re­verse en­g­ineered the GTM, there’s no good way to know that you’ve built a good enough ap­prox­i­ma­tion, i.e., to know that the ML model won’t end up con­verg­ing to an­swers that are differ­ent from the GTM.

A three part model of philo­soph­i­cal reasoning

It may be eas­ier to un­der­stand the difficulty of cap­tur­ing philo­soph­i­cal rea­son­ing with ML by con­sid­er­ing a more con­crete model. I sug­gest we can di­vide it into three parts as fol­lows: A. Pro­pose new ideas/​ar­gu­ments/​coun­ter­ar­gu­ments/​etc. ac­cord­ing to some (im­plicit) dis­tri­bu­tion. B. Eval­u­ate ex­ist­ing ideas/​ar­gu­ments/​coun­ter­ar­gu­ments/​etc. C. Based on past ideas/​ar­gu­ments/​coun­ter­ar­gu­ments/​etc., up­date some hid­den state that changes how one does A and B. It’s tempt­ing to think that build­ing an ap­prox­i­ma­tion of B us­ing ML per­haps isn’t too difficult, and then we can just search for the “best” ideas/​ar­gu­ments/​coun­ter­ar­gu­ments/​etc. us­ing stan­dard op­ti­miza­tion al­gorithms (maybe with some safety pre­cau­tions like try­ing to avoid ad­ver­sar­ial ex­am­ples for the learned model). There’s some chance this could work out well, but with­out hav­ing a deeper un­der­stand­ing of metaphilos­o­phy, I don’t see how we can be con­fi­dent that throw­ing out A and C won’t lead to dis­aster, es­pe­cially in the long run. But A and C seem very hard or im­pos­si­ble for ML to cap­ture (A due to paucity of train­ing data, and C due to the un­ob­serv­able state).

Is there a way around this difficulty? What else can we do in the ab­sence of a full white-box solu­tion to metaphilos­o­phy?