Defining Optimization in a Deeper Way Part 1

My aim is to define optimization without making reference to the following things:

A “null” action or “nonexistence” of the optimizer. This is generally poorly defined, and choices of different null actions give different answers.
Repeated action. An optimizer should still count even if it only does a single action.
Uncertainty. We should be able to define an optimizer in a fully deterministic universe.
Absolute time at all. This will be the hardest, but it would be nice to define optimization without reference to the “state” of the universe at “time t”.

Attempt one

First let’s just eliminate the concept of a null action. Imagine the state of the universe at a time $t$ .

Let’s divide the universe into two sections and call these $A$ and $B$ . They have states $S^{A}$ and $S^{B}$ . If we want to use continuous states we’ll need to have some metric $D (s_{1}, s_{2})$ which applies to these states, so we can calculate things like the variance and entropy of probability distributions over them.

Treat $S^{A}$ and $S^{B}$ as part of a Read-Eval-Print-Loop. Each $S_{t}^{A}$ produces some output $O_{t}^{A}$ which acts like a function mapping $S_{t}^{B} \to S_{t + 1}^{B}$ , and vice versa. $O_{t}^{A}$ can be thought of as things which cross the Markov blanket.

Sadly we still have to introduce probability distributions. Let’s consider a joint probability distribution $P_{t}^{A B} (s^{A}, s^{B})$ , and also the two individual probability distributions $P_{t}^{A} (s^{A})$ and $P_{t}^{B} (s^{B})$ .

By defining distributions over $O$ outputs based on the distribution $P_{t}^{A B}$ , we can define $P_{t + 1}^{A B} (s^{A}, s^{B})$ in the “normal” way. This looks like integrating over the space of $s_{A}$ and $s_{B}$ like so:

$P_{t + 1}^{A B} (s_{t + 1}^{A}, s_{t + 1}^{B}) = \int P_{t}^{A B} (s_{t}^{A}, s_{t}^{B}) δ (O_{t}^{A} (s_{t}^{B}) - s_{t + 1}^{B}) δ (O_{t}^{B} (s_{t}^{A}) - s_{t + 1}^{A}) d s_{t}^{A} d s_{t}^{B}$

What this is basically saying is that to define the probability distribution of states $s_{t + 1}^{A}$ and $s_{t + 1}^{B}$ , we integrate over all states $s_{t}^{A}$ and $s_{t}^{B}$ and sum up the states where the $O_{t}^{A}$ corresponding to $s_{t}^{A}$ maps $s_{t}^{B}$ to the given $s_{t + 1}^{B}$ .

Now lets define an “uncorrelated” version of $P_{t + 1}^{A B}$ , which we will refer to as $P_{t + 1}^{' A B}$ .

$P_{t + 1}^{' A B} (s_{t + 1}^{A}, s_{t + 1}^{B}) = \int P_{t}^{A} (s_{t}^{A}) P_{t}^{B} (s_{t}^{B}) δ (O_{t}^{A} (s_{t}^{B}) - s_{t + 1}^{B}) δ (O_{t}^{B} (s_{t}^{A}) - s_{t + 1}^{A}) d s_{t}^{A} d s_{t}^{B}$

This loosely represents what happens if we decorrelate $s^{A}$ and $s^{B}$ . In the language of humans, this is like an agent taking a random move from a selection.

We can refer to a probability distribution $P_{t}^{A B}$ as an “optimizing” probability distribution if $P_{t + 1}^{' A B}$ is higher entropy than $P_{t + 1}^{A B}$ .

For an example, imagine the universe is divided into two parts: a room $R$ and a thermostat $T$ . The room can have states in the set $s^{R} \in {h o t, l u k e w a r m, c o l d}$ , and the thermostat can have states $s^{T} \in {h i g h, l o w, o f f}$ . Imagine that $O^{R}$ and $O^{T}$ are defined as follows:

$O^{T} [h i g h] : ⎧ ⎨ ⎩ \begin{matrix} h o t \to h o t w a r m \to h o t c o l d \to w a r m \end{matrix}$
$O^{T} [l o w] : ⎧ ⎨ ⎩ \begin{matrix} h o t \to h o t w a r m \to w a r m c o l d \to c o l d \end{matrix}$
$O^{T} [o f f] : ⎧ ⎨ ⎩ \begin{matrix} h o t \to w a r m w a r m \to c o l d c o l d \to c o l d \end{matrix}$

$O^{R} [h o t] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to o f f l o w \to o f f o f f \to o f f \end{matrix}$
$O^{R} [w a r m] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to l o w l o w \to l o w o f f \to l o w \end{matrix}$
$O^{R} [c o l d] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to h i g h l o w \to h i g h o f f \to h i g h \end{matrix}$

Basically the thermostat decides whether the room gets warmer, stays the same, or gets colder, and the thermostat.

We can also consider the probability mass flowing from each of the nine states to another one:

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$(h o t, o f f)$	$(h o t, l o w)$	$(w a r m, h i g h)$
$l o w$	$(h o t, o f f)$	$(w a r m, l o w)$	$(c o l d, h i g h)$
$o f f$	$(w a r m, l o w)$	$(c o l d, l o w)$	$(c o l d, h i g h)$

Imagine the following $P_{0}^{T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	0	0	$1 / 3$
$l o w$	0	$1 / 3$	0
$o f f$	$1 / 3$	0	0

This will give us the following $P_{1}^{T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	0	$1 / 3$	0
$l o w$	0	$1 / 3$	0
$o f f$	0	$1 / 3$	0

Which has 1.6 bits of entropy.

And the following $P_{1}^{' T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$1 / 9$	$2 / 9$
$l o w$	$1 / 9$	$1 / 9$	$1 / 9$
$o f f$	$2 / 9$	$1 / 9$	$0$

Which has 2.7 bits of entropy.

This means that the joint-ness of the probability distribution $P_{0}^{R T}$ has removed 1.1 bits of entropy from the system. We say that our choice of $P_{0}^{R T}$ is optimizing, with an optimizing strength of 1.1 bits.

But what if we consider a “smarter” thermostat, which turns off just before the temperature changes.

$O^{R} [h o t] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to o f f l o w \to o f f o f f \to l o w \end{matrix}$
$O^{R} [w a r m] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to l o w l o w \to l o w o f f \to l o w \end{matrix}$
$O^{R} [c o l d] : ⎧ ⎨ ⎩ \begin{matrix} h i g h \to l o w l o w \to h i g h o f f \to h i g h \end{matrix}$

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$(h o t, o f f)$	$(h o t, l o w)$	$(w a r m, l o w)$
$l o w$	$(h o t, o f f)$	$(w a r m, l o w)$	$(c o l d, h i g h)$
$o f f$	$(w a r m, o f f)$	$(c o l d, l o w)$	$(c o l d, h i g h)$

With the same choice of $P_{0}^{T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	0	0	$1 / 3$
$l o w$	0	$1 / 3$	0
$o f f$	$1 / 3$	0	0

This will give us the following $P_{1}^{T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	0	0	0
$l o w$	0	1	0
$o f f$	0	0	0

With an entropy of zero.

And the following $P_{1}^{' T R} (s^{T}, s^{R})$ :

	$h o t$	$w a r m$	$c o l d$
$h i g h$	$0$	$0$	$2 / 9$
$l o w$	$1 / 9$	$1 / 3$	$1 / 9$
$o f f$	$2 / 9$	$0$	$0$

Which has 2.2 bits of entropy.

In the new system, $P_{0}^{R T}$ has an optimizing strength of 2.2 bits, approximately twice as much. This indicates that the latter system is “better” at optimizing the distribution $P_{0}^{R T}$ in some way.

So we have eliminated the idea of needing the optimizer to have clearly-defined existence/nonexistence cases, or needing some “null” action to compare its outputs to. This is good. We have also eliminated the concept of repeated action.

Next I will attempt to eliminate the need to start with a probability distribution. In both of the examples above, our choice of $P_{0}^{R T}$ was important. I want to find a more “natural” way of defining probability distributions.