It’s also possible to use use the Q from the QR decomposition to accomplish this. This has some advantages for me (specifically, you can orthogonalize arbitrary unfolded tensors which are parameterized in factored form), however, I believe the gradients via SGD will be less nice when using QR.
Naively, there probably isn’t a better way to learn than via gradient descent (possible with better initialization etc.). This is ‘just some random non-convex optimization problem’, so what could you hope for? If you minimize sparsity on a single input as opposed to on average, then it seems plausible to me that you could pick a sparsity criteria such that the problem can be optimized in a nicer way (but I’d also expect that minimizing sparsity on a single input isn’t really what you want).
You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.
Work I’m doing at redwood involves doing somewhat similar things.
Some observations which you plausibly are already aware of:
You could use geotorch for the parametrization. geotorch has now been ‘upstreamed’ into pytorch as well
It’s also possible to use use the Q from the QR decomposition to accomplish this. This has some advantages for me (specifically, you can orthogonalize arbitrary unfolded tensors which are parameterized in factored form), however, I believe the gradients via SGD will be less nice when using QR.
Naively, there probably isn’t a better way to learn than via gradient descent (possible with better initialization etc.). This is ‘just some random non-convex optimization problem’, so what could you hope for? If you minimize sparsity on a single input as opposed to on average, then it seems plausible to me that you could pick a sparsity criteria such that the problem can be optimized in a nicer way (but I’d also expect that minimizing sparsity on a single input isn’t really what you want).
You could hope for more even for a random non-convex optimization problem if you can set up a tight relaxation. E.g. this paper gives you optimality bounds via a semidefinite relaxation, though I am not sure if it would scale to the size of problems relevant here.
Interesting QR decomposition idea. I’m going to try using the Q as the initialization point of the rotation matrix, and see if this has any effect.