Splitting Decision Theories

[epistemic status: maybe wrong; thinking aloud, would like people to yell at me]

There is a repeated motion that occurs when deciding what an AI should do:

(1) Create a decision theory

(2) Create a thought experiment in which an agent with *DT makes a choice which fails to fulfill its utility function (e.g. Oh no! It loses all its money to blackmail!)

(3) Create a DT which does well against problems which the core difficulty which allowed the previous decision theory to lose all its money

If decision theories are as precisely imagined as mathematical structures. For every two distinct decision theories, there exists in mathematical reality a set of “thought experiments” such that the two theories decide differently on them.

This seems weird and difficult now because there isn’t a shared logical notation between different “thought experiments”. As of now characterizing the class of splitting decision problems for two decision theories is pretheoretic. However, for every pair of decision theories DT_1 and DT_2 the object split(DT_1, DT_2) actually exists. Current notational limits make it currently difficult to simply and completely characterize the class of choice problems on which two DTs give different answers.

But it feels like this sort of problem occupies a similar status as “algorithms” did before the first Universal Turing Machine was constructed.

-

Questions:

In fun games (like prisoner’s dilemma) we have agents (like fairbot) that fight each other. The source code for these agents is entangled with their decision theory. Does examining bots engaged in modal combat make this problem more tractable?

This process repeats like clockwork (it feels like a new decision theory comes out every year or so?) in hopes of giving their baby AI a good way of making good choices and not losing all its money. What if I built an AI that formalized and internalized this process and just … gave itself good advice? Within logical inductors traders bet on which theorems would be best and traders which make bad bets lose their money. If we can formalize split(DT_1, DT_2) we can look at how well agents fulfill their utility functions in this space. Can we use this to establish a kind of poset of decision theories?