The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.
Because a matrix multiplication of two N×N matrices only involves 2N2 distinct floating point numbers as input, and writing the result back into memory is going to cost you another N2 memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size N×N is 3N2. In contrast, if you’re using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you N additions and N multiplications, so you end up with 2N⋅N2=2N3 ALU ops needed.
The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you’ll end up being bottlenecked by one of them and you’ll be unable to exploit the surplus resources you have on the other side. For instance, if you’re working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you’re only using the hardware to do matrix multiplications you’ll have enormous amounts of MEM ops capacity sitting idle because you don’t have the capacity to be utilizing them.
This is helpful, thanks a ton Ege!