Good questions. First, but answering your last question, my thesis is really just about proposing another way to quantify computational WORK. Currently, we use application-generic measures like Flops or application-specific measures like images/tokens processed. These definitions of work can then be normalized by unit time, energy, power, GHG emissions, water, etc, but my focus is on getting the numerator right (work) regardless of the denominator. I’m also only interested in application-generic measures of work.
You’ve got it exactly right that the amount of work done is dependent on the runtime data seen. This is a critical aspect. Mutual information is this runtime quantity that measure the actual work performed whereas channel capacity is an upper-bound on the mutual information possible and is only dependent on the hardware itself. The channel capacity is what would show up on a spec sheet, and MI is what would be plotted for a kernel on a roofline plot or similar runtime evaluation.
You’ve also got it right that if a compute operation always outputs the same value, its mutual information (and computational work) is zero. Could this operation not just be optimized away with a fixed constant? If so, what was the use of performing the operation? Either the inputs are fixed, or the inputs vary but the output is always the same. In both cases, replacing the output with a constant changes nothing. Shouldn’t this mean there was no value to performing the operation in the first place? This is where discussions about structural sparsity come into play. If we know a certain input is 0 and we’re multiplying it with another number, we already know what the outcome is, so there is no reduction of uncertainty by multiplying by a known 0.
The general intuition behind this work is that, just like communication, computation is not just about the outputs you see, its about what outputs you DIDN’T see. We use math to formalize relationships and to talk about the exactness of real numbers, but no hardware in finite space/time can operate on the infinite set of real numbers. Instead, when we get an output value, it’s simply one of 2^64 (or 2^32, 2^16, 2^8, 2^4, …) possible output states. An FP64 operation manages a larger state space than an NVFP4 operation does. Bit width approximations of Flops (where a Flops value is scaled by the bit-width of the operands) is sometimes used to capture this difference in complexity—like with the TPP export control definitions. However, this is exactly where the runtime distribution of data matters. If I know my inputs are restricted to NVFP4 range, and computing in FP64 doesn’t use any more number of states than computing in NVFP4, should such an inefficient program’s work really scale by a factor of 16? I think not, and that’s why the runtime data is critical to measuring computational work IMHO.
Essentially, how much uncertainty about the output does an operation resolve? That’s what I think we should measure. By doing so, we unify computation and communication performance measurement and are able to generalize computing measures across digital, noisy, analog, neuromorphic, and maybe even quantum regimes. Communication has had a theoretically-grounded and universal measure of throughput for ~80 years. Why doesn’t computing?
There’s more, but I’ll leave it here. Please feel free to email me anytime: mhawkins60@gatech.edu
I look forward to seeing it. I’m not sure what that characterization would look like, but I personally would like to see a diagram/roadmap of computing from raw resource extraction through manufacturing as well as an organized list of computing architectures. It seems there are enough novel devices out there to warrant some categorization and listing.