To me there is a difference between the hardware for 1xN by Nx1 and MxN by NxM (with N > M > 1). Although any matrix operation is many 1xN by Nx1 dot products, doing them independently would be inefficient. ”If you do a matrix multiplication the obvious way, this results in dot products of rows and columns (one for each element of the resulting matrix). So it seems to me that improving matrix to matrix multiplication performance comes from improving the performance of dot products.” True, but not individual dot products, but the collective of very many dot products. Obviously you do not do it the obvious way as you would have to load the same data over and over again.
To me there is a difference between the hardware for 1xN by Nx1 and MxN by NxM (with N > M > 1). Although any matrix operation is many 1xN by Nx1 dot products, doing them independently would be inefficient.
”If you do a matrix multiplication the obvious way, this results in dot products of rows and columns (one for each element of the resulting matrix). So it seems to me that improving matrix to matrix multiplication performance comes from improving the performance of dot products.”
True, but not individual dot products, but the collective of very many dot products. Obviously you do not do it the obvious way as you would have to load the same data over and over again.