Joseph Van Name comments on Work dumber not smarter

Joseph Van Name 10 Jun 2023 10:59 UTC
1 point
0
I was thinking of constructing a deep neural network by interlacing ordinary linear layers $x \mapsto A x + b$ with tropical layers $x \mapsto (C ⊙ x) \oplus d$ where the $⊙, \oplus$ refer to tropical matrix multiplication and addition. One can think of this as replacing the ReLU activation with the more complicated expression $(x_{1}, \dots, x_{n}) \mapsto max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d) = a_{1} \otimes x_{1} \oplus \dots \oplus a_{n} \otimes x_{n} \oplus d$ . Someone probably thought about this already and they may have already run experiments using this approach.
One problem with this approach is that for an input $x_{1}, \dots, x_{n}$ , there will typically be a unique $j \in {0, \dots, n}$ where $max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d) = a_{j} + x_{j}$ or where $j = 0, max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d) = d$ . This means that only one of the biases $a_{1}, \dots, a_{n}, d$ will contribute to the output $max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d)$ . Once trained, this is not a bad thing because it simply means that the tropical operation is sparse and sparse matrices are good for saving space, but during training we may want to pump up the values of $a_{1}, \dots, a_{n}, d$ so that $(P (max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d) = a_{j} + x_{j}) > α / (n + 1)$ and at the end of training, we can allow for the probability $(P (max (a_{1} + x_{1}, \dots, a_{n} + x_{n}, d) = a_{j} + x_{j})$ to go down so that we can reduce the number of weights in the tropical portion of the matrix.
Another possible issue with tropical matrices is how tropical matrices may reduce the dimension of the data. Consider the tropical layer $T_{C, d} (x) = (C ⊙ x) \oplus d$ where $C$ is an $m \times n$ -matrix. Then for almost all $x$ , there will be a function $f_{x} : {1, \dots, n} \to {1, \dots, m}$ where $J (T_{C, d}) (x) = (δ_{f_{x} (i), j})_{i, j}$ ( $J$ stands for the Jacobian). If image $Im (f_{x})$ is too small, then the tropical layer $T_{C, d}$ may throw away too much information from the vector $x$ .
One may also need to apply a couple of other tricks to make sure that the tropical layers work right. For example, one can start the training just by using ReLU activation and then after everything is going right, we can replace ReLU activation with the more general tropical layers. By starting off with ReLU, we can ensure that $Im (f_{x})$ is large enough so that the tropical layers do not destroy too much information. Further training with tropical layers can only improve the network.
Any ReLU MLP can be turned into a composition of ordinary linear layers with tropical linear layers. Is there any reason why this would not work? I think I should do a couple of experiments with tropical matrix operations in the place of ReLU to see if it works and what should be done to optimize neural networks formed by interlacing tropical matrix operations with ordinary matrix operations. In an MLP, most of the parameters are the entries for a matrix in order to linearly transform a vector to another vector, but for some reason, we do not load the the activation layer with parameters.
I do not think that I have much more knowledge of tropical geometry than the average mathematician, so I do not think I am too biased in favor of tropical geometry.