Epistemic status: A brisk walkthrough of (what I take to be) the highlights of this book’s contents.

The big one for mathematically understanding ML!

The idea responsible for getting me excited about linear algebra is:

Linear maps are the homomorphisms between vector spaces.

Linear algebra is about the tripartite relationship between (1) homomorphisms^[1] between vector spaces, (2) sets of equations, and (3) grids of numbers.

However, grids of numbers (‘matrices’), the usual star of the show in a presentation of linear algebra, aren’t foregrounded in this book. Instead, this is a book chiefly treating the homomorphisms (‘linear maps’) themselves, directly.

Contents and Notes

1. Vector Spaces

Vector spaces are fairly substantial mathematical structures, if you’re pivoting out of thinking about set theory! Intuitively, a vector space is a space $R^{n}$ for which (1) ray addition and (2) scaling rays (emanating from the origin out to points)^[2] are both nicely defined.

Precisely, a vector space is a set $V$ defined over a field $F$ ^[3] in which

V is closed under vector addition, and vector addition is commutative, associative, there is an additive identity $\to 0$ , and there is an additive inverse for every vector $\to v \in V$ ;
V is closed under scalar multiplication, scalar multiplication is associative, and there is a multiplicative identity $1$ ;
and vector addition and scalar multiplication are connected by distribution such that, for all $a, b \in F$ and $\to v, \to x \in V$ ,^[4]

a (\to v + \to x) = a \to v + a \to x

(a + b) \to v = a \to v + b \to v

A subspace $S$ of a vector space $V$ is any subset $S \subset V$ that is still itself a vector space, under the same two operations of $V$ . Vector spaces can be decomposed into their subspaces, where you think of adding vectors drawn the different subspace via their common addition operation.

2. Finite-Dimensional Vector Spaces

You live at the origin of $R^{3}$ , and your tools are the vectors that emanate out from your home. Because we have both vector addition and scalar multiplication, we have two ways of extending (or shortening) any single vector out from the origin arbitrarily far. If we’re interested in reaching points in $R^{3}$ , one immediate way to get to points we didn’t have a vector directly to… is by extending a too-short vector pointed in the right direction! Furthermore, because we can always multiply a vector by $- 1$ to reverse its direction, both the exactly right and exactly wrong directions will suffice to reach out and touch a point in $R^{3}$ .

We can also use vector addition to add two vectors pointing off in differing directions (directions which aren’t exact opposites). If we have vectors $\to v = [0.5, 0, 0]^{T}$ , $\to x = [0, 45, 0]^{T}$ , and $\to q = [0, 0, 0.11]^{T}$ ,^[5] we have all the tools we need to produce any vector in $R^{3}$ ! The awkward lengths of all the vectors are irrelevant, because we can scale all of them arbitrarily. We use some amount of vertical, horizonal, and $z$ -dimensional^[6] displacement to get to anywhere via addition and multiplication! More formally, we say that the set ${\to v, \to x, \to q}$ spans $R^{3}$ .

Intuitively, a minimal spanning set is called a basis for a vector space. ${\to v, \to x, \to q}$ is a basis for the vector space $R^{3}$ , because none of the vectors are “redundant”: you could not produce every vector in $R^{3}$ without all three elements in ${\to v, \to x, \to q}$ . If you added any further vector to that spanning set, though, the set would now have a redundant vector, as $R^{3}$ is already spanned. The set would no longer be a minimal spanning set in this sense, and so would cease to be a basis for $R^{3}$ .

Every finite-dimensional, nonzero^[7] vector space containing infinitely many vectors has infinitely many bases (pp. 29-32). Each basis for an $n$ -dimensional vector space is a set containing $n$ vectors, where each vector is an ordered set containing $n$ numbers drawn from $F$ (p. 32).

3. Linear Maps

Intuitively, a linear map is a function that translates addition and multiplication between two vector spaces.

Formally, a linear map $f : V \to W$ is a function from a vector space $V$ to a vector space $W$ (taking vectors and returning vectors) such that

f (\to v + \to x) = f (\to v) + f (\to x)

f (a \to v) = a (f (\to v))

for all $\to v, \to x \in V$ ; all $f (\to v), f (\to x) \in W$ ; and all $a \in F$ . Note that both are homomorphism properties: one for addition across vector spaces and one for multiplication across vector spaces! We’ll call the former relationship additivity, the latter, homogeneity.

The symbol $L (V, W)$ stands for the set of all the linear maps from $V$ to $W$ .^[8]

Some example linear maps (pp. 38-9) include:

f_{1} (\to v) = 0 \to v

f_{2} (\to v) = \to v

When the vector spaces are specifically the set of all real-valued polynomials $p (x)$ :^[9]

f_{3} (\to p) = \frac{d p (x)}{d x}

f_{4} (\to p) = \int p (x) d x

translating between $\to p$ and $p (x)$ .

As linear maps are functions, they can be composed when they have matching domains and co-domains, giving us our notion of products between linear maps.

The kernel of a linear map $f \in L (V, W)$ is the subset $ker (f) \subset V$ containing all and only the vectors $\to v \in V$ that $f$ maps to $\to 0 \in W$ . Note that linear maps can only “get rid” of vectors by shrinking them down all the way, i.e., by sending them to $\to 0$ . If a function between vector spaces simply sent everything to a nonzero vector, it would violate the linear map axioms! All kernels are subspaces of $V$ (p. 42). A linear map is injective whenever $ker (f) = {\to 0}$ (p. 43).

The image $im (f)$ of $f$ is the subset of $W$ covered by some $f (\to v)$ . All images are subspaces of $W$ (p. 44). A linear map is obviously surjective whenever $im (f) = W$ .

The Matrix of a Linear Map

A matrix $M$ is an array of numbers, with $m$ rows and $n$ columns:

M = ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} a_{1, 1} & \dots & a_{1, n} ⋮ & ⋱ & ⋮ a_{m, 1} & \dots & a_{m, n} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦

(Matrices are a generalization of vectors into the horizontal dimension, and vectors can be thought of as skinny $m$ -by- $1$ matrices.)

Let $f \in L (V, W)$ . Suppose that $[{\to v}_{1}, \dots, {\to v}_{n}]^{T}$ is a basis of $V$ and $[{\to w}_{1}, \dots, {\to w}_{n}]^{T}$ is a basis of $W$ . For each $k = 1, \dots, n$ , we can write $f ({\to v}_{k})$ uniquely as a linear combination of the $w$ ’s:
$f ({\to v}_{k}) = a_{1, k} {\to w}_{1} + \dots + a_{m, k} {\to w}_{m}$
where $a_{j, k} \in F$ for $j = 1, \dots, m$ . The scalars $a_{j, k}$ completely determine the linear map $f$ because a linear map is determined by its values on a basis. The $m$ -by- $n$ matrix $M$ formed by the $a$ ’s is called the matrix of $f$ with respect to the bases $[{\to v}_{1}, \dots, {\to v}_{n}]^{T}$ and $[{\to w}_{1}, \dots, {\to w}_{n}]^{T}$ ; we denote it by
$M (f, [{\to v}_{1}, \dots, {\to v}_{n}]^{T}, [{\to w}_{1}, \dots, {\to w}_{n}]^{T})$
…
If you think of elements of $F^{m}$ as columns of $m$ numbers, then you can think of the $k$ th column of $M (f)$ as $f$ applied to the $k$ th basis vector (pp. 48-9; notation converted to our own.)

The vector $f (\to v) = M (f) \to v$ , with matrix multiplication on the right side of the equation (pp. 53-4).

4. Polynomials

5. Eigenvalues and Eigenvectors

We now begin our study of operator theory!

Any vector space $V$ we discuss from here on out will be neither the zero vector space ${\to 0}$ nor an infinite-dimensional vector space.

Operators are linear maps from $V$ to itself. Notationally, $L (V) := L (V, V)$ .

We call a subspace $S \subset V$ invariant under $f \in L (V)$ if, for all $\to s \in S$ ,

f (\to s) \in S

Let $S$ now specifically be a one-dimensional subspace of $V$ such that, fixing any nonzero $\to v \in V$ ,

S = {a \to v : a \in F}

Then $S$ is a one-dimensional subspace of $V$ , and every one-dimensional subspace of $V$ is of this form. If $\to v \in V$ and the subspace $S$ defined by ${a \to v : a \in F}$ is invariant under $f \in L (V)$ , then $f (\to v)$ must be in $S$ , and hence there must be a scalar $λ \in F$ such that $f (\to v) = λ \to v$ . Conversely, if $\to v$ is a nonzero vector in $V$ such that $f (\to v) = λ \to v$ for some $λ \in F$ , then the subspace S defined by ${a \to v : a \in F}$ is a one-dimensional subspace of $V$ invariant under $f$ (p. 77; notation converted).

In the above equation $f (\to v) = λ \to v$ , the scalar $λ$ is called an eigenvalue of $f$ , and the corresponding vector $\to v$ is called an eigenvector of $f$ .

Because $f (\to v) = λ \to v$ is equivalent to $(f - λ I) \to v = 0$ ,^[10] we see that the set of eigenvectors of $f$ corresponding to $λ$ equals $ker (f - λ I)$ . In particular, the set of eigenvectors of $f$ corresponding to $λ$ is a subspace of $V$ .
[For example,] if $a \in F$ , then $a I$ has only one eigenvalue, namely, $a$ , and every vector is an eigenvector for this eigenvalue (p. 77-8; notation converted).

Polynomials Applied to Operators

The main reason that a richer theory exists for operators… than for linear maps is that operators can be raised to powers (p. 80).

An operator raised to a power $m$ is just that operator composed with itself $m$ times.

Because we have a notion of functional products, functional sums, and now operators raised to powers, we can now construct arbitrary polynomials with operators as the variables!

Upper-Triangular Matrices

A square matrix is an $m \times m$ matrix.

An upper-triangular matrix is a square matrix for which all entries under the principal diagonal equal $0$ .

Every operator on a finite-dimensional, nonzero, complex vector space has an eigenvalue (p. 81).
…
Suppose $V$ is a complex vector space and $f \in L (V)$ . Then $f$ has an upper-triangular matrix with respect to some basis of $V$ (p. 84; notation converted).
…
Suppose $f \in L (V)$ has an upper-triangular matrix with respect to some basis of $V$ . Then the eigenvalues of $f$ consist precisely of the entries on the diagonal of that upper-triangular matrix (p. 86; notation converted).

Diagonal Matrices

A diagonal matrix is a square matrix for which all entries off the principal diagonal equal $0$ .

If $f \in L (V)$ has $dim V$ ^[11] distinct eigenvalues, then $f$ has a diagonal matrix with respect to some basis of $V$ (p. 88; notation converted).

6. Inner-Product Spaces

For $\to x, \to y \in R^{n}$ , the dot product of $\to x$ and $\to y$ , denoted by $\to x \cdot \to y$ , is defined by
$\to x \cdot \to y = x_{1} y_{1} + \dots + x_{n} y_{n}$

where $x_{n}$ is the $n$ th entry in $\to x$ , and similarly for $y_{n}$ and $\to y$ (p. 98; notation converted).

Inner products are just a generalization of dot products to arbitrary vector spaces $V$ . (With some finagling, both dot products and inner products generally can be interpreted as linear maps.) An inner-product space is an ordered set containing a vector space $V$ and an inner product on it.

Intuitively, the norm of a vector is the length of that vector, interpreted as a ray, from the origin to its tip. More formally, the norm of a vector $\to v$ in an inner-product space is defined to be the square root of the inner product of that vector $\to v$ with itself:

∥ \to v ∥ := \sqrt{\to v \cdot \to v}

Note that this looks just like $c = \sqrt{a^{2} + b^{2}}$ , the Pythagorean theorem for the sides $a, b, c$ of a right triangle in Euclidian space. That’s because other inner products on other vector spaces are meant to allow for a generalization of the Pythagorean theorem in those vector spaces!

Intuitively, two vectors are orthogonal when they’re perpendicular. Formally, two vectors are called orthogonal when their inner product is $0$ . With the opposite and adjacent sides $\to a, \to b$ of the right unit triangle in the vector space $R^{2}$ ,

\to a \cdot \to b = a_{1} b_{1} + a_{2} b_{2} = (0) (1) + (1) (0) = 0

“It’s all just right triangles, dude.”

7. Operators on Inner-Product Spaces

Complex Spectral Theorem: Suppose that $V$ is a complex inner-product space and $f \in L (V)$ . Then $V$ has an orthonormal^[12] basis consisting of eigenvectors of $f$ if and only if $f$ is normal^[13] (p. 133; notation converted).
…
Real Spectral Theorem: Suppose that $V$ is a real inner-product space and $f \in L (V)$ . Then $V$ has an orthonormal basis consisting of eigenvectors of $f$ if and only if $f$ is self-adjoint^[14] (p. 136; notation converted).
…
In other words, to multiply together two block diagonal matrices^[15] (with the same size blocks), just multiply together the corresponding entries on the diagonal, as with diagonal matrices.
A diagonal matrix is a special case of a block diagonal matrix where each block has size $1$ -by- $1$ . At the other extreme, every square matrix is a block diagonal matrix because we can take the first (and only) block to be the entire matrix… The smaller the blocks, the nicer the operator (in the vague sense that the matrix then contains more $0$ ’s). The nicest situation is to have an orthonormal basis that gives a diagonal matrix (p. 143).

The singular values of $f$ are the eigenvalues of $\sqrt{f^{*} f}$ , where each eigenvalue $λ$ is repeated $dim ker (f^{*} f - λ I)$ times (p. 155).

Every operator on $V$ has a diagonal matrix with respect to some orthonormal bases of $V$ , provided that we are permitted to use two different bases rather than a single basis as customary when working with operators (p. 157).

8. Operators on Complex Vector Spaces

9. Operators on Real Vector Spaces

We have defined eigenvalues of operators; now we need to extend that notion to square matrices. Suppose $A$ is an $n$ -by- $n$ matrix with entries in $F$ . A number $λ \in F$ is called an eigenvalue of $A$ if there exists a nonzero $n$ -by- $1$ matrix $x$ such that
$A x = λ x$
For example, $3$ is an eigenvalue of $[\begin{matrix} 7 & 8 1 & 5 \end{matrix}]$ because
$[\begin{matrix} 7 & 8 1 & 5 \end{matrix}] [\begin{matrix} 2 - 1 \end{matrix}] = [\begin{matrix} 6 - 3 \end{matrix}] = 3 [\begin{matrix} 2 - 1 \end{matrix}]$
(p. 194).
…
Suppose $f \in L (V)$ and $A$ is the matrix of $f$ with respect to some basis of $V$ . Then the eigenvalues of $f$ are the same as the eigenvalues of $A$ (p. 194; notation converted).
…
Cayley-Hamilton Theorem: Suppose $V$ is a real vector space and $f \in L (V)$ . Let $q$ denote the characteristic polynomial^[16] of $f$ . Then $q (f) = 0$ (p. 207; notation converted).

The Cayley-Hamilton theorem also holds on complex vector spaces generally (p. 173).

10. Trace and Determinant

The matrix of an operator $f \in L (V)$ depends on a choice of basis of $V$ . Two different bases of $V$ may give different matrices of $f$ (p. 214; notation converted).

Intuitively, the determinant of an operator $f$ is the change in volume $f$ effects. The determinant is negative when the operator ~~flips all the vectors~~ “inverts the volume” it works on.

If $V$ is a complex vector space, then $det f$ equals the product of the eigenvalues of $f$ , counting multiplicity… Recall that if $V$ is a complex vector space, then there is a basis of $V$ with respect to which $f$ has an upper-triangular matrix… thus $det f$ equals the product of the diagonal entries of the matrix (p. 222; notation converted).

^
Intuitively, a a homomorphism is a function showing how the operation of vector addition can be translated from one vector space into another and back.
More precisely, a homomorphism is a function (here, from a vector space $V$ to a vector space $W$ ) such that
$f (\to v + \to x) = f (\to v) + f (\to x)$
with $\to v, \to x \in V$ and $f (\to v), f (\to x) \in W$ .
The vector addition symbol $+$ on the left side of the equality, inside the function, is defined in $V$ , and the addition symbol $+$ on the right side of the equality, between the function values, is defined in $W$ .
^
Vectors can be interpreted geometrically as rays from the origin out to points in a space. Vectors can also be understood algebraically as ordered sets of numbers (with each number representing a coordinate over in the ray interpretation).
As far as notation goes, we’ll use variables with arrows $\to v$ for vectors, lowercase variables $x$ for numbers, and capital variables $V$ for other larger mathematical structures, such as vector spaces.
^
In this book, that field $F$ will be either the reals $R$ or the complexes $C$ .
^
Take note of how homomorphism-ish the below distributive relationships are!
^
Vectors are conventionally written vertically. But each vector $\to v = [\begin{matrix} 10 \end{matrix}]$ has a transpose $[1, 0]^{T} = \to v = [\begin{matrix} 10 \end{matrix}]$ , where the vector is written out horizontally instead.
So we’ll use vector transposes to stay in line with conventional notation while not writing out those giant vertical vectors everywhere.
^
One deep idea out of mathematics is that the dimensionality of a system is just the number of variables in that system that can vary independently of every other variable. You live in $3$ -dimensional space because you can vary your horizontal, vertical, and $z$ -dimensional position without necessarily changing your position in the other two spatial dimensions by doing so.
^
Note that the set ${\to 0}$ , where $\to 0$ is a vector containing only $0$ any number $n \in N$ of times, satisfies the vector space axioms!
$\to 0 + \to 0 = \to 0 = (\to 0 + \to 0) + \to 0 = \to 0 + (\to 0 + \to 0)$
establishes closure under addition, existence of an additive identity, existence of an additive inverse for all vectors, additive commutativity, and additive associativity. Letting the field be the reals with $n, m \in R$
$n \to 0 = \to 0 = m (n \to 0) = (m n) \to 0 = 1 (\to 0)$
establishes closure under multiplication, multiplicative associativity, and the existence of a multiplicative identity. Finally,
$n (\to 0 + \to 0) = n \to 0 + n \to 0 = \to 0 = (n + m) \to 0 = n \to 0 + m \to 0$
establishes distributivity.
Any such vector space ${\to 0}$ has just one basis, $\emptyset$ . Intuitively, since you live at the origin, the origin is already spanned by no vectors at all—i.e., the empty set of vectors. Any additional vector would be redundant, so no other sets constitute bases for ${\to 0}$ .
^
In math, the bigger and/or fancier the symbol, the bigger the set or class that symbol usually stands for.
^
A vector $\to p$ can stand for a polynomial by containing all the coefficients in the polynomial, coefficients ordered by the degree of each coefficient’s monomial.
^
This is addition of functions, $(f + g) x = f (x) + g (x)$ , on the left side of the equation. $I$ is the identity function.
^
$dim V$ is the dimension of $V$ , formalized as the number of vectors in any basis of $V$ .
^
Intuitively, orthonormal sets are nice sets of vectors like ${[1, 0, 0]^{T}, [0, 1, 0]^{T}, [0, 0, 1]^{T}}$ , where each vector has length one and is pointing out in a separate dimension.
More precisely, a set of vectors is called orthonormal when its elements are pairwise orthogonal and each vector has a norm of $1$ . We will especially care about orthonormal bases, like the set above with respect to $R^{3}$ .
^
The adjoint of a linear map $f : V \to W$ is a linear map $f^{*} : W \to V$ such that the inner product of $f (\to v)$ and $\to w$ equals the inner product of $\to v$ and $f^{*} (\to w)$ for all $\to v \in V$ and $\to w \in W$ .
Remember that inner products aren’t generally commutative, so the order of arguments matters. Adjoints feel very anticommutative.
An operator $f \in L (V)$ on an inner-product space $V$ is called normal when
$f f^{*} = f^{*} f$
^
An operator $f$ is self-adjoint when $f = f^{*}$ .
^
A block diagonal matrix is a square matrix of the form
$⎡ ⎢ ⎢ ⎣ \begin{matrix} A_{1} & 0 ⋱ 0 & A_{m} \end{matrix} ⎤ ⎥ ⎥ ⎦$
where $A_{1}, \dots, A_{m}$ are square matrices lying along the diagonal and all the other entries of the matrix equal $0$ (p. 142).
^
Suppose $V$ is a complex vector space and $f \in L (V)$ . Let $λ_{1}, \dots, λ_{m}$ denote the distinct eigenvalues of $f$ . Let $d_{j}$ denote the multiplicity of $λ_{j}$ as an eigenvalue of $f$ . The polynomial
$(x - λ_{1})^{d_{1}} \dots (x - λ_{m})^{d_{m}}$
is called the characteristic polynomial of $f$ . Note that the degree of the characteristic polynomial of $f$ equals $dim V$ … the roots of the characteristic polynomial of $f$ equal the eigenvalues of $f$ (p. 172; notation converted).
Characteristic polynomials can also be defined for real vector spaces, though the reals are a little less well behaved as vector spaces than the complexes.
Suppose $V$ is a real vector space and $f \in L (V)$ . With respect to some basis of $V$ , $f$ has a block upper-triangular matrix [any entries acceptable above $A_{1}, \dots, A_{m}$ ] of the form
$⎡ ⎢ ⎢ ⎣ \begin{matrix} A_{1} & * ⋱ 0 & A_{m} \end{matrix} ⎤ ⎥ ⎥ ⎦$
where each $A_{j}$ is a $1$ -by- $1$ or a $2$ -by- $2$ matrix with no eigenvalues. We define the characteristic polynomial of $f$ to be the product of the characteristic polynomials of $A_{1}, \dots, A_{m}$ . Explicitly, for each $j$ , define $q_{j} \in P (R)$ by
$q_{j} (x) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} x - λ & if A_{j} = [λ]; (x - a) (x - d) - b c & if A_{j} = [\begin{matrix} a & c b & d \end{matrix}] \end{matrix}$
Then the characteristic polynomial of $f$ is
$q_{1} (x) \dots q_{m} (x)$
Clearly the characteristic polynomial of $f$ has degree $dim V$ … The characteristic polynomial of $f$ depends only on $f$ and not on the choice of a particular basis (p. 206; notation converted).

Linear Algebra Done Right, Axler