Joseph Van Name comments on Joseph Van Name’s Shortform

Joseph Van Name 22 Sep 2025 7:29 UTC
2 points
0
In this post, we shall go over a way to produce mostly linear machine learning classification models that output probabilities for each possible label. These mostly linear models are pseudodeterministically trained (or pseudodeterministic for short) in the sense that if we train them multiple times with different initializations, we will typically get the same trained model (up-to-symmetry and miniscule floating point differences).
The algorithms that I am mentioning in this post generalize to more complicated multi-layered algorithms in the sense that the multi-layered algorithms remain pseudodeterministic, but for simplicity, we shall stick to just linear operators here.
Let $K$ denote either the field of real numbers, the field of complex numbers, or the division ring of quaternions. Let $U$ be a finite dimensional inner product space over $K$ . The training data is a set $D$ of pairs $(u, v)$ where $u \in U$ and $v \in {1, \dots, n}$ where $u$ is the machine learning model input and $v$ is the label. The machine learning model is trained to predict the label $v$ when given the input $u$ . The trained model is a function $f$ that maps $U$ to the set of all probability vectors of length $n$ , so the trained model actually gives the probabilities for each possible label.
Suppose that $V_{i}$ is a finite dimensional inner product space over $K$ for each $i \in {1, \dots, n}$ . Then the domain of the fitness function consists of tuples $(A_{1}, \dots, A_{n})$ where each $A_{i}$ is a linear operator from $U$ to $V_{i}$ . Let $p \in (0, 1), r \in (0, \infty), q \in (1, \infty)$ , and let $λ \geq 0$ . The parameter $p$ is the exponent while $λ$ is the regularization parameter. Define (almost total) functions $G, R, F : L (U, V_{1}) \times \dots \times L (U, V_{n}) \to R$ by setting
$G (A_{1}, \dots, A_{n}) = \sum_{(u, v) \in D} (\frac{∥ A_{v} u ∥^{r}}{∥ A_{1} u ∥^{r} + \dots + ∥ A_{n} u ∥^{r}})^{p} / | D |$
$R (A_{1}, \dots, A_{n}) = (\sum_{(u, v) \in D} λ \cdot log (∥ A_{v} u ∥) / | D |)$
$- λ \cdot (log (∥ A_{1} ∥_{q}) + \dots + log (∥ A_{n} ∥_{q})) / n$ .
Here, $∥ * ∥_{q}$ denotes the Schatten $q$ -norm which can be defined by setting
$∥ A ∥_{q} = Tr ((A A^{*})^{q / 2})$ .
Set $F = G + R$ . Here, $F$ denotes our fitness function. The function $G$ what we really want to maximize, but unfortunately, $G$ is typically non-pseudodeterministic, so we need to add the regularization term $R$ to obtain pseudodeterminism. The regularization term $R$ also has the added effect of making $∥ A_{v} u ∥$ relatively large compared to the norm $∥ A ∥_{q}$ for training data points $(u, v)$ . This may be useful in determining whether a pair should belong to either the training or test data in the first place.
We observe that $F$ is $0$ -homogeneous in the sense that $F (A_{1}, \dots, A_{n}) = F (c A_{1}, \dots, c A_{n})$ for each non-zero scalar $c$ (in the quaternionic case, the scalars are just the real numbers).
Suppose now that we have obtained a tuple $(A_{1}, \dots, A_{n})$ that maximizes the fitness $F (A_{1}, \dots, A_{n})$ . Let $P V (n)$ denote the set of all probability vectors of length $n$ . Then define an almost total function $f : U \to P V (n)$ by setting
$f (u) = \frac{(∥ A_{1} u ∥^{r (1 - p)}, \dots, ∥ A_{n} u ∥^{r (1 - p)})}{∥ A_{1} u ∥^{r (1 - p)} + \dots + ∥ A_{n} u ∥^{r (1 - p)}} .$
If $(u, v)$ belongs to the training data set, then the $i$ -th entry of $f (u)$ is the machine learning model’s estimate of the probability that $i = v$ . I will let the reader justify this calculation of the probabilities.
We can generalize the function $f$ to pseudodeterministically trained machine learning models with multiple layers by replacing the linear operators $A_{1}, \dots, A_{n}$ with some non-linear or multi-linear operators. Actually, there are quite a few ways of generalizing the fitness function $F$ , and I have taken some liberty in the exact formulation for $F$ .
In addition to being pseudodeterministic, the fitness function $F$ has other notable desirable properties. For example, when maximizing $F$ using gradient ascent, one tends to converge to the local maximum at an exponential rate without needing to decay the learning rate.