A deterministic partial environment (DPE) over action set A and observation set O is a pair (D,ϕ) where D⊆(O×A)∗ and ϕ:D→O s.t.
If h∈(O×A)∗ is a prefix of some g∈D, then h∈D.
If h,g∈D, p∈O and hp is a prefix of g, then ϕ(h)=p.
DPEs are equipped with a natural partial order. Namely, (D,ϕ)≤(E,ψ) when D⊆E and ϕ=ψ|D.
Let S be a strong upwards antichain in the DPE poset which doesn’t contain the bottom DPE (i.e. the DPE with D=∅). Then, it naturally induces an infra-POMDP. Specifically:
The state space is S.
The initial infradistribution is ⊤S.
The observation mapping is ω(D,ϕ):=ϕ(ϵ), where ϵ is the empty history.
The transition infrakernel is T(D,ϕ;a):=⊤N(D,ϕ;a), where
If N(D,ϕ;a) is non-empty for all (D,ϕ)∈S and a∈A, this is a learnable undogmatic ontology.
Any n∈N yields an example Sn. Namely, (D,ϕ)∈Sn iff D≠∅ and for any h∈D it holds that:
|h|≤n
If |h|<n then for any a∈A, haϕ(a)∈D.
I think that for any continuous some non-trivial hidden reward functions over such an ontology, the class of communicating RUMDPs is learnable. If the hidden reward function doesn’t depend on the action argument, it’s equivalent to some instrumental reward function.
Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.
A deterministic partial environment (DPE) over action set A and observation set O is a pair (D,ϕ) where D⊆(O×A)∗ and ϕ:D→O s.t.
If h∈(O×A)∗ is a prefix of some g∈D, then h∈D.
If h,g∈D, p∈O and hp is a prefix of g, then ϕ(h)=p.
DPEs are equipped with a natural partial order. Namely, (D,ϕ)≤(E,ψ) when D⊆E and ϕ=ψ|D.
Let S be a strong upwards antichain in the DPE poset which doesn’t contain the bottom DPE (i.e. the DPE with D=∅). Then, it naturally induces an infra-POMDP. Specifically:
N(D,ϕ;a):={(E,ψ)∈S|∀h∈(O×A)∗:ϕ(ϵ)ah∈D⟹h∈E∧ψ(h)=ϕ(ϕ(ϵ)ah)}The state space is S.
The initial infradistribution is ⊤S.
The observation mapping is ω(D,ϕ):=ϕ(ϵ), where ϵ is the empty history.
The transition infrakernel is T(D,ϕ;a):=⊤N(D,ϕ;a), where
If N(D,ϕ;a) is non-empty for all (D,ϕ)∈S and a∈A, this is a learnable undogmatic ontology.
Any n∈N yields an example Sn. Namely, (D,ϕ)∈Sn iff D≠∅ and for any h∈D it holds that:
|h|≤n
If |h|<n then for any a∈A, haϕ(a)∈D.
I think that for
any continuoussome non-trivial hidden reward functions over such an ontology, the class of communicating RUMDPs is learnable. If the hidden reward function doesn’t depend on the action argument, it’s equivalent to some instrumental reward function.