The entire SIMD vector approach is good for many dot products but it is not the same as a systolic array for rank two on rank two multiplication.
If the job would be to multiply two 1024x1024 matrices then a systolic array of 256x256 MACs would be a good choice. It would work four times on 256x1024 by 1024x256 matrices for 1024+256 steps.
The entire SIMD vector approach is good for many dot products but it is not the same as a systolic array for rank two on rank two multiplication.
If the job would be to multiply two 1024x1024 matrices then a systolic array of 256x256 MACs would be a good choice. It would work four times on 256x1024 by 1024x256 matrices for 1024+256 steps.
Because the SIMD approach is bad for 2D on 2D matrix multiplication NVIDIA has introduced:
Tensor Cores in the Volta architecture.
Article about it:
https://www.anandtech.com/show/12673/titan-v-deep-learning-deep-dive/3