Insights from Linear Algebra Done Right

This book has previously been discussed by Nate Soares and TurnTrout. In this… review? report? detailed journal entry? … I will focus on the things which stood out to me. (Definitely not meant to be understandable for anyone unfamiliar with the field.) The book has ten chapters; I did most of the exercises in chapters 6 to 10. I got through that half of the book in about 3 weeks, which was nice since the previous (harder and longer) textbook on topology I worked through took me about a year.

I’ve previously taken two introductory courses to liner algebra and came out thinking of the field as large and very unstructured, with lots of concepts and theorems floating around disconnectedly. What is a normal matrix? I might or might not have remembered the definition, but I certainly had no idea what it is good for. Three weeks of intense study with this book has improved my understanding dramatically. Now, the field seems structured and pretty intuitive (and I certainly won’t forget what a normal operator is). It is truly hard to overstate how good of a job this book does teaching the subject, compared to what I’ve seen before. It’s probably the best textbook on anything I’ve read so far.

Chapter 1 introduces complex numbers and vectors spaces.

Chapter 2 introduces finite-dimensional vector spaces. What stands out is just how nicely behaved they are. Every finite-dimensional real or complex vector space is isomorphic to either or . Take any linearly independent vectors, and they span a subspace of dimension . If they’re not linearly independent, then one vector is in the span of the previous vectors. Any list of linearly independent vectors make up a basis for the entire space. Basically every result one would want to hold when looking back onto the material does actually hold.

Like most fields in math, Linear Algebra ultimately cares about studying functions, and once again it only cares about a tiny subset of all possible functions. In this case it is linear maps, which are the focus of Chapter 3. These are functions such that and for all and , where and are vector spaces over some field (in this book, always or ).

Brief comparison to topology: The word “continuity” is only mentioned in the final chapter of this book, but every linear map can be shown to be continuous if the topologies on both vector space are given by an inner product. Actually, every inner product induces a norm, every norm induces a metric, any metric induces a topology, but none of the reverse steps are true. Put differently, metric topologies are particularly nice topologies, metrics from norms are particularly nice metrics, norms from inner-products are particularly nice norms, and inner products are the concept studied in chapters 6 and 7. So Linear Algebra only even bothers with the the super nicely behaved stuff.

I’ll now detail my biggest confusion with LA before reading this book (this is something I’ve even brought up explicitly, but without getting a satisfying answer): if is a linear map, then it is actually possible to determine its behavior entirely by fixing bases of and of and storing six numbers

which tell us that and . (This determines the behavior of on all vectors because of linearity of and the fact that and are bases). The matrix is dependent on the two bases chosen here (given these bases, the set of all linear maps from to is bijective to that of 3-by-2 matrices).

Linear maps aren’t relative to bases. Vectors were introduced as elements of , so not relative to bases either, but then they were introduced again in “coordinate form”, where we write ( is our basis) to mean ” times the first basis vector plus times the second,” and suddenly a dependence to our basis has been created. My confusion was then: is the in the index just a reminder that is understood to be relative to that basis, or is it a function that maps the vector to the vector ? In the former case, how should one differentiate between coordinate vectors and non-coordinate vectors? (We’ve used the same notation.) And how do we specify the vectors in a basis? If I write and that means “1 times the first basis vector, 0 times the second” while trying to specify my basis, then clearly that’s no good. And the latter case is even worse: now matrix-with-vector-multiplication is ill-defined. Consider the standard basis on and the basis with both vectors reversed. Then we have

But, clearly

So we have a flat out contradiction. And it gets worse yet if is a matrix and one defines a linear map by , which was actually done in one of my past lectures.

So that was the problem. In contrast, here’s what the book does. Vectors are elements in , not relative to bases. Maps are functions , likewise not relative to bases. Matrices are a bunch of numbers put into an array, and one defines the matrix of a linear map , denoted , in the way I’ve described above (so relative to two bases). Most importantly, there is no matrix-vector multiplication. There is matrix-matrix multiplication, which is of course fine because both matrices are relative to bases if they come from linear maps. To “apply” a matrix to a vector, one defines, given a vector and a basis , the matrix of a vector as the unique -by- matrix

whose entries are the scalars needed to write as a sum of basis vectors of . Then one proves that . Note the square brackets: the book uses brackets for lists (lists are the same as tupels, i.e. elements of and vectors are lists), and brackets for matrices only, which is a solid convention. Does this approach resolve this problem to full satisfaction and without introducing any new difficulties? I think so.

Chapter 4 introduces polynomials. It doesn’t have much LA in it, but understanding polynomials well is important for LA, because polynomials can be applied to operators (those are linear maps of the form , i.e. from a vector space into itself). This is so because, given , we can define if and and (the identity map).

The other special thing about the book – and the justification the author gives for calling it “Linear Algebra Done Right” – is that determinants aren’t introduced until the final chapter. This is very unusual: most courses, such as the lectures I took, frequently use determinants in proofs. But all of these proofs can be done without them. And they tend to be fairly simple, too; only a couple are longer than a page.

Doing this, one finds that there are striking parallels between properties about polynomials and properties about linear vector spaces, which seems a lot better for developing a useful intuition than what one gets by doing determinant-based proofs. The central theorem about operators introduced in Chapter 5, for example, states that every operator on a finite-dimensional complex vector space has an eigenvalue (that is, a scalar such that there exists a nonzero vector such that ). This can be proven by essentially applying the fundamental theorem of algebra. One chooses any nonzero vector and constructs the list , where . These vectors cannot be linearly independent because they are many, so there exist scalars such that . Now (or rather, after doing some preliminary work) the fundamental theorem of algebra kicks in, and we can reformulate this as where the are the roots of the polynomial . Since the RHS equals 0, one of the isn’t injective, and therefore has the respective as an eigenvalue. This is done properly in the book in relatively little space. It also shows why the same theorem fails to hold on real vector spaces: the reformulation of the polynomial to might no longer be possible.

Chapter 6 introduces inner products. These are the generalized formalization underlying the concept of angle between vectors. There is a famous inequality, called the Cauchy-Schwartz inequality, which states that . The book gives a very nice proof for this that I find easier to remember than the proofs I’ve seen previously. It is based on the orthogonal decomposition of a vector, where if and are any nonzero vectors, then equals for some vector and some scalar such that and are orthogonal to each other. I hadn’t heard of this before and it’s quite useful. If one remembers the formula for the orthogonal decomposition, plugs it into the inner product and does the obvious things, the Cauchy-Schwartz inequality pops out at the other end.

The absolute most nicest things ever studied anywhere are orthonormal bases. These are lists of vectors such that a) they are all linearly independent, b) they span the entire space (follows from a) if there are many), c) they are all orthogonal to each other and d) they all have length 1. Because this is LA, orthonormal bases always exist for finite-dimensional inner-product spaces and there is even an explicit algorithm to construct one from any existing non-orthonormal basis. Something really cool that the book does with this (which also requires a neat theorem about minimal distance of a vector to a subspace) is to construct the polynomial of degree at most 5 that is the most similar to the function on the interval (based on some reasonable definition of similarity involving integrals). I say this is really cool because I found it to be a surprisingly strong result, and a surprising kind of result coming out of linear algebra, given that there is really nothing linear about the sin function. On the interval , this approximation looks indistinguishable to sin to the normal eye.

Back to orthonormal bases: what their existence means is that, whenever one tries to prove anything involving inner product spaces, one can simply say something like, “let be an orthonormal basis for and let be an orthonormal basis for ”, and then proceed to reduce whatever argument one needs to the basis elements. This feels very powerful, even if arbitrarily difficult problems exist regardless.

With Chapter 7, we’re getting to the big classification theorems. LA wants to study linear maps, and in this chapter it classifies all of the linear maps such that there exists an orthonormal basis relative to which they have diagonal form. This is the aforementioned spectral theorem. These are the normal operators on complex vector spaces, and the self-adjoint operators (which is a strictly stronger property) on real vector spaces. This highlights another recurring pattern: there are often two versions of each result, one super nice one about complex vector spaces, and a less nice, more cumbersome one about real vector spaces. This makes a lot of sense given that is far more nicely behaved than (this is kind of why we defined complex numbers in the first place). More specifically, it usually comes down to the fact that every polynomial of degree has a root in , but not necessarily in , just like it did in the aforementioned result on the existence of eigenvalues. The book also draws an interesting parallel between self-adjoint operators and real numbers.

In Chapters 8 and 9, we return to the study of eigenvalues. Basically, the problem is as follows: given an operator , one would like to find a decomposition of into smaller subspaces such that a) each vector can be uniquely written as a weighted sum of the , i.e. with and b) the operator maps each subspace into itself, i.e. . If this is achieved, then the behavior of is determined entirely by the behavior of the (i.e. restricted to ); and this is great because the are much easier to handle than . In fact, if the are one-dimensional vector spaces (i.e. just straight lines, if interpreted geometrically) then just multiplies each vector with a scalar. In other words, . This is of course just the equation that says that is an eigenvalues of with nonzero eigenvector , and so that’s the reason why eigenvalues are of interest. Conversely, given an eigenvalue with eigenvector , the corresponding one-dimensional subspace is simply .

As stated before, every operator on a complex vector spaces has some eigenvalue. But does it have different eigenvalues, so that ? (The symbol means “direct sum”.) In “most” cases, the answer is yes – by which I mean that, if one generates an operator randomly (by sampling the coefficients of its matrix relative to the standard basis randomly out of some large finite set), then as one increases the size of that set (for example by switching from 32 bit to 64 bit to 128 bit floating point arithmetic and so on), the probability of generating an operator for which this doesn’t work converges to 0. And I believe it is also possible to define a non-discrete probability space such as , give it the uniform distribution, and then prove that the set of operators for which this doesn’t work has probability mass 0.

So for “typical” operators, this is always true, but there do exist specific operators for which it isn’t. One such example is the operator whose matrix wrt the standard basis is . This operator has the eigenvalue but the only eigenvectors are of the form . Thus there exists one one-dimensional subspace that is invariant under , namely , but not a second one, so there isn’t any decomposition of into two invariant subspaces.

To remedy this, Chapter 8 introduces generalized eigenvectors. Observe that the property can also be written as (where is the identity map) or as , where , i.e. is the one-dimensional vector space containing . The equation above then says that, if we take , restrict it to and subtract times the identity operator, we get the zero operator (so the 0 on the right denotes the null map ). This reformulation leads to the non-obvious definition of a generalized eigenvector: given an eigenvalue of with corresponding one-dimensional subspace , a generalized eigenvector of is a vector such that for some . Which is to say, subtracting times the identity operator from doesn’t have to yield the null operator immediately, but it does have to yield the null operator if the resulting operator is applied several times.

In our example above, only vectors of the form are eigenvectors, but all vectors are generalized eigenvectors, because is the operator whose matrix wrt to the standard basis is , and this operator equals the null operator if it is applied twice, i.e. .

If all of the above is done properly, then one gets something that comes very close to the perfect decomposition for any operator on a complex vector space. This is the theory behind the famous Jordan form for matrices: it is not quite a diagonal matrix (which corresponds to the perfect decomposition), but it almost is. It just has a bunch of additional 1s to deal with the generalized eigenvectors.

Chapter 9 then does a similar thing for real vector spaces. As always, it’s more cumbersome and the result is weaker, but it’s still quite strong.

At last, Chapter 10 introduces determinants, and it’s actually kind of boring! All of the interesting LA has already been done, so this chapter feels like a mere afterthought. Again, one doesn’t seem to need determinants to do basic LA.


The next big item on my list is Understanding Machine Learning, which is also from Miri’s guide. I’ve so far neglected trying to get familiar with actual AI research, and it’s time to remedy that, or at least get a solid grasp of the basics.