Towards learning incomplete models using inner prediction markets

This post is a short informal summary of a new idea I’m starting to work on, which combines my thinking abouts “incomplete models” with Scott’s logical inductors.

Motivation

Before, I speculated that generalizing Bayesian inference to include incomplete models will allow solving the grain of truth problem in a way more satisfactory than what was achieved so far. More generally, this would allow getting performance guarantees for agents in environments that are as complex or more complex than the agent.

Now, such performance guarantees are already known for learning algorithms suited to the non-realizable settings. However, as Jessica noted here, those methods don’t address long-term planning due to the scarcity of training data. On the other hand, Bayesian methods do allow long-term planning: if the environment is realizable (i.e. absolutely continuous w.r.t. the prior), on-policy merging of opinions will occur at a rate that doesn’t depend on the utility function. This means that for a fixed environment and sufficiently slowly falling time discount, the agent will be able to form effective long-term plans, at least as long on-policy forecasting is sufficient. Of course, realistic settings require off-policy forecasting, which requires some exploration. If we want global optimality in policy space, we would have to explore for an entire horizon which means long-term planning fails again. However, I think that satisfactory weaker optimality guarantees can be achieved by more conservative exploration, especially when “consulting a (human) expert” is an available form of “exploration”.

This advantage of Bayesian agents is only applicable in the realizable case, which is an unrealistic assumption. However, the inclusion of incomplete models would bring the advantage into the non-realizable case: the environment might be arbitrarily complex, but as long as it conforms to some simple incomplete model, this model can be learnt quickly and exploited for long-term planning.

Proposal

Previously, I suggested addressing incomplete models using non-Bayesian decision rules. Here, I propose a different approach. At each moment of time, the policy is selected using normal expected utility maximization, for some posterior probability measure. However, the posterior probability measure doesn’t come from updating a prior on observations. Instead, it comes from consulting an “inner prediction market”: Scott’s logical inductor adapted for forecasting the environment instead of forecasting logical sentences (this is somewhat similar to the “universal inductor”).

In the pure forecasting setting, the formalisation seems straightforward. Instead of a “share” for each logical sentence, our market has a “share” for each event of the form $e_{< n} \in A$ , where $A \subseteq O^{n}$ . More generally, it seems useful to consider a “share” for each continuous function $f : O^{ω} \to R$ (the ultimate value of such a share can be any number rather than only 0 or 1). Instead of “propositionally consistent worlds” we have $O^{ω}$ . The “deductive process” is simply the observation of $x = e_{< n}$ (so that $x O^{ω}$ are the remaining consistent worlds). The “market” itself is a sequence ${μ_{n} \in P (O^{ω})}_{n \in N}$ .

In the general setting, we can take the event space to be either $(A \times O)^{*} \times A \to O$ or $(A \times O)^{ω}$ . In the first case, there will be some events we will never observe (similar to undecidable sentences in “classical” logical induction). In the second case, we will have to choose the policy by conditioning on it “EDT style.” I suspect that the two approaches are equivalent.

Given an incomplete model $Φ \in P_{C} (E)$ , we should be able to construct a trader $T_{Φ}$ that gain as long as $μ_{n} \notin Φ$ . This trader will buy the shares of some $f : O^{ω} \to R$ s.t. $E_{μ_{n}} [f] < {min}_{ν \in Φ} E_{ν} [f]$ . Such an $f$ is guaranteed to exist by the Hahn-Banach separation theorem, moreover the size of the separation can probably be made $(max f - min f) d_{tv} (μ_{n}, Φ)$ . When the true environment satisfies $μ \in Φ$ and $T_{Φ}$ is included in the set of traders that define the inductor, this should imply that $d_{tv} (μ_{n}, Φ) \to 0$ , and in particular that the policy of the agent is asymptotically Pareto optimal for $Φ$ .