I propose training d using something like this:
Suppose d is outputting a binary string of length n. Then it starts by outputting a probability. The first bit gets sampled using this probability. The agent is told this bit, and then it outputs the probability for the next bit. Then the second bit gets sampled using this probability. And so on, until n bits are output. Now it’s easy to estimate the entropy of d.
Perhaps outputting the distribution over strings bit-by-bit like this is difficult. But there are other ways of representing distributions that allows sampling from them and estimating their entropy. Consider the following algebraic data type, meant to represent distributions:
data Distr a where
Unit :: Distr ()
Flip :: Double -> Distr Bool
InvFmap :: (a <-> b) -> Distr a -> Distr b
Samp :: (a -> Distr b) -> Distr a -> Distr (a, b)
Unit is a delta distribution on (); Flip flips a coin with a certain probability; InvFmap composes a distribution with an invertible function (which could be represented as a pair of functions f and g; to run the invertible function on an input x, take f(x) and make sure that g(f(x))=x or else throw an error); Samp is similar to the monadic bind >>= but keeps the original sample of a around; a sample from Samp f a is (x, y) where x∼a and y∼f(x). The intention is to have everything be fully reversible, so there is no way to throw away information.
Samp f a
(note that it’s also possible to have Unsamp :: (a -> Distr b) -> Distr (a, b) -> Distr a, intended as an inverse of Samp; I can give more details if you’re interested).
Unsamp :: (a -> Distr b) -> Distr (a, b) -> Distr a
How do you estimate the entropy of this distribution?
If you want to force the computation to end up with no garbage, then this significantly restricts the range of functions you compute, since the sampling process was invertible. In particular, they need to be functions such that you can compute the agent’s entire computation history only given the output, which feels like a restriction that may be dodging the hard cases of the informed oversight problem.
If you allow the agent to end up with garbage, then you can calculate the probability of that particular garbage+outputs, but the entropy of that distribution will be unboundedly larger than the entropy of the distribution over outputs.
The approach I have in mind is (roughly) to let the agent output some number of bits of garbage, but penalize for the number of bits of garbage (so generating additional uniformly random garbage doesn’t make a difference to the score). I think this can be done using autoencoders (use layer n+1 to compress layer n into a small number of bits of garbage). It’s not clear whether this approach is practical for complex agents, though.
In the OWF example, the garbage is necessarily low-entropy though (at least k bits short on entropy, where k is the size of advice needed to invert the OWF). Right?
Yes, that seems right. So this won’t work for that example.
Additional note: I think you can implement the approach I sketched (which I’m calling “reversible probabilistic programming”) using autoencoders. Represent each layer of the autoencoder by a single-layer neural net f : a -> Distr b and approximate inverse g : b -> Distr a. Given a distribution for the previous layer x : Distr a, get the distribution for the next layer by taking Unsamp g (InvFmap (\(x, y) -> (y, x)) (Samp f x)) :: Distr b. Compose a lot of these together to get a multi-layer generative model. This seems simple enough that there’s probably a simple more direct way to estimate the entropy of a generative model represented by an autoencoder.
f : a -> Distr b
g : b -> Distr a
x : Distr a
Unsamp g (InvFmap (\(x, y) -> (y, x)) (Samp f x)) :: Distr b
(actually, I think this might not work, because the inverses won’t be very accurate, but maybe something like this works?)