Joseph Van Name comments on Joseph Van Name’s Shortform

Joseph Van Name 28 Apr 2025 7:28 UTC
1 point
0
Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.
Suppose that $F : X \to [- \infty, \infty)$ is a continuous fitness function. In an ideal universe, we would like for the function $F$ to have just one local maximum. If $F$ has just one local maximum, we say that $F$ is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form $F (x)$ for local maximum $(x, F (x))$ . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value.
Set $O = F^{- 1} [(- \infty, \infty)] = X ∖ F^{- 1} [{- \infty}]$ . If the set $O$ is disconnected (in a topological sense) and if $L$ behaves differently on each of the components of $L$ , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of $O$ still has a unique local maximum value.
Let $m_{0}, \dots, m_{n}$ be positive integers with $m_{0} = m_{n} = 1$ and where $m_{1} \geq 1, \dots, m_{n - 1} \geq 1$ . Let $r_{0}, \dots, r_{n - 1}$ be other natural numbers. The set $X$ is the collection of all tuples $A = (A_{i, j})_{i, j}$ where each $A_{i, j}$ is a real $m_{i + 1} \times m_{i}$ -matrix and where the indices range from $i \in {0, \dots, n - 1}, j \in {1, \dots, r_{i}}$ and where $(A_{i, j})_{j}$ is not identically zero for all $i$ .
The training data is a set $Σ$ that consists of input/label pairs $(u, v)$ where $v \in {- 1, 1}$ and where $u = (u_{0}, \dots, u_{n - 1})$ such that each $u_{i}$ is a subset of ${1, \dots, r_{i}}$ for all $i$ (i.e. $Σ$ is a binary classifier where $u$ is the encoded network input and $v$ is the label).
Define $W (u, A) = (\sum_{j \in u_{n - 1}} A_{n - 1, j}) \dots (\sum_{j \in u_{0}} A_{0, j})$ . Now, we define our fitness level by setting
$F (A) = \sum_{(u, v) \in Σ} log (| W (u, A) |) / | Σ | - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$
$= E (log (| W (u, A) |)) - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$ where the expected value is with respect to selecting an element $(u, v) \in Σ$ uniformly at random. Here, $∥ * ∥_{p}$ is a Schatten $p$ -norm which is just the $ℓ_{p}$ -norm of the singular values of the matrix. Observe that the fitness function $F$ only depends on the list $(u : (u, v) \in Σ)$ , so $F$ does not depend on the training data labels.
Observe that $O = X ∖ ⋃_{u \in Σ} {A \in X : W (u, A) = 0}$ which is a disconnected open set. Define a function $f : O \to {- 1, 1}^{Σ}$ by setting $f (A) = (W (u, A) / | W (u, A) |)_{(u, v) \in Σ}$ . Observe that if $x, y$ belong to the same component of $O$ , then $f (x) = f (y)$ .
While the fitness function $F$ has many local maximum values, the function $F$ seems to typically have at most one local maximum value per component. More specifically, for each $(α_{i})_{i \in Σ}$ , the set $f^{- 1} [{(α_{i})_{i \in Σ}}]$ seems to typically be a connected open set where $F$ has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).
Let $Ω = f^{- 1} [{(v)_{(u, v) \in Σ}]$ . Then $Ω$ is a (possibly empty) open subset of $O$ , and there tends to be a unique (up-to-symmetry) $A_{0} \in Ω$ where $F (A_{0})$ is locally maximized. This unique $A_{0}$ is the machine learning model that we obtain when training on the data set $Σ$ . To obtain $A_{0}$ , we first perform an optimization that works well enough to get inside the open set $Ω$ . For example, to get inside $Ω$ , we could try to maximize the fitness function $\sum_{(u, v) \in Σ} arctan (v \cdot W (u, A))$ . We then maximize $F$ inside the open set $Ω$ to obtain our local maximum.
After training, we obtain a function $f$ defined by $f (u) = W (u, A_{0})$ . Observe that the function $f$ is a multi-linear function. The function $f$ is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function $f$ has been trained so that $f (u) / | f (u) | = v$ for each $(u, v) \in Σ$ but also so that $| f (u) |$ is large compared to what we might expect whenever $(u, v) \in Σ$ . In other words, $f$ is helpful in determining whether $(u, v)$ belongs to $Σ$ or not since one can examine the magnitude and sign of $f (u)$ .
In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function $f$ (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.