I’m a crypto researcher at $dayjob, and I work with zero knowledge proofs daily. Practical zk-proofs are implemented as arithmetic circuits, which allow efficient proofs about adding, subtracting, multiplying, and comparing integers, typically approximately 256-bits in length. Obviously any integer math is trivial to prove, and so can be fixed-precision or rational numbers. But general floating point types can’t be efficiently encoded as operations on integer values with this precision. So you’d have to either (1) restrict yourself to fixed precision numbers (which also avoids all the famous problems with floating point math exploited in the story), or (2) use the equivalent of software-defined floating point on top of arithmetic circuits, which causes proof sizes and computation time to scale inversely proportional to how much slower software floating point is compared with hardware (which is a lot). No exaggeration, if your zk-proof took about a second to compute and is tens of kilobytes in size—typical for real systems used—then a floating-point math version might take minutes or hours to compute and be megabytes in size. Totally impractical, so no, no one does this.
(If you want a crypto exploit that allows for arbitrary inflation, I would have used a co-factor vulnerability like the one that Monero was hit with back in 2017, or a weakness in the inner-product argument of bulletproofs used in mimblewimble, or a weakness in the pairing curve used for zcash proofs, etc. Not floating-point.)
I’ll take your word on Engrish. I’ve never used that word online so I don’t know what the custom is here. Just speaking as someone who has spent significant time in Taiwan and Japan, I’ve only seen that word used among expats in Japan. The construction of the word is particularly specific to Japanese, which does not distinguish between the the l and r phonemes. Mandarin does however make that distinction. Chinese speakers have many issues with English, to be sure, but this isn’t one of them. I can see how the word could have taken broader meaning outside of the context in which it was coined, however.
The 5 orders of magnitude number comes from a rule of thumb for the general speedup you can get for reducing complex but highly parallel computation to ASIC implementation using state of the art process nodes. It is, for example, the rough speedup you get from moving from GPU to ASIC for bitcoin mining, and I believe for hardware raytracing it is the same. Neural nets are outside my area of expertise, but from afar I understand them to be a similar “embarrassingly parallel” application where such speedups can occur. I’m open to being shown wrong here. However that multiplier also shows up independently in latency numbers: HPC switching (e.g. Infiniband) can be sub-100ns, but inter-cloud latency is in the 10s of ms. That’s a factor of 100,000x. I felt I was being generous in assuming that one of these effects will bottleneck, but it is also possible there’d be a larger combined slow down.
None of those points are central to the question of whether a hard take-off is possible, however. But they are essential to a heuristic I use to evaluate whether someone’s claims are credible: if you wander outside of your area of expertise and into mine, I assume you at least consulted an expert to review and fact check the basic stuff. If you didn’t, why should I trust anything you say about other domains, like neural net architectures? Your story hinges on there being a sort of phase transition which causes a step function in the performance and general intelligence of Clippy. You’ve got links to papers whose abstracts seem to back that claim up. But you also similarly hand-waved with citations about floating point and zero knowledge proofs. How do I know your assertions about AI are more credible?
I guess I’m a bit crusty on this because I feel Eliezer’s That Al Message really did damage by priming people with the wrong intuitions about the relative speed advantages of near-term AI, even presuming a hardware overhang. This story feels like the same sort of thing, and I fear people will accept it as a persuasive argument. Regardless of whether they should.
Your floating point counterargument is irrelevant. Yes, it would be a bad idea. You already said that. You did not address any of my points about bad ideas being really really common in crypto (is mixing in some floating point really worse than, say, using ternary for everything binary? That is a realworld crypto which already exists. And while I’m at it, the FP inefficiency might be a reason to use FP—remember how the Bytecoin and other scams worked by obfuscating code and blockchain), nor did you offer any particular reason to think that this specific bad idea would be almost impossible. People switch between floating and integer all the time. Compilers do all sorts of optimizations or fallbacks which break basic security properties. There are countless ways to screw up crypto; secure systems can be composed in insecure ways; and so on.
You’ll “take my word on Engrish”? You don’t need to, I provided WP and multiple dictionaries. There is nothing hard about “and other Asian languages” or movie examples about going to a Chinese food restaurant and making fun of it. If you don’t know anything about the use of ‘Engrish’ and don’t bother to check a single source even when they are served to you on a silver platter, why on earth are you going around talking about how it discredits me? This is bullshit man. “Spotchecking” doesn’t work if you’re not checking, and using your expertise to check for Gell-Man amnesia doesn’t work if you don’t have expertise. That you don’t even care that you were so trivially wrong bothers me more than you being wrong.
No response to the unprofessional criticism, I see. How unprofessional.
Neural nets are outside my area of expertise
Pity this story is about neural nets, then. In any case, I still don’t see where you are getting 10,000x from or how ASICs are relevant, or how any of this addresses the existing and possible techniques for running NNs across many nodes. Yes, we have specialized ASICs for NN stuff which work better than CPUs They are great. We call them “TPUs” and “GPUs” (you may have heard of them), and there’s plenty of discussion about how the usual CPU->ASIC speedup has already been exhausted (as Nvidia likes to point out, the control flow part you are removing to get those speedups for examples like video codecs is already a small part of the NN workload, and you pay a big price in flexibility if you try to get rid of what’s left—as specialized AI chip companies keep finding out the hard way when no one can use their chips). I mean, just think critically for a moment: if the speedup from specialized hardware vs more broadly accessible hardware really was >>10,000x, if my normal Nvidia GPU was 1⁄10,000th the power of a comparable commercial chip, how or why is anyone training anything on regular Nvidia GPUs? With ratios like that, you could run your home GPUs for years and not get as much done as on a cloud instance in an hour or two. Obviously, that’s not the case. And, even granting this, it still has little to do with how much slower a big NN is going to run with Internet interconnects between GPUs instead of on GPU/TPU clusters.
Gwern, you seem to be incapable of taking constructive criticism, and worse you’ve demonstrated an alarming disregard for the safety of others in your willingness to doxx someone merely to score a rhetorical point. Thankfully in this case no harm was done, but you couldn’t have known that and it wasn’t your call to make.
I will not be engaging with you again. I wish you the best.
I’m a crypto researcher at $dayjob, and I work with zero knowledge proofs daily. Practical zk-proofs are implemented as arithmetic circuits, which allow efficient proofs about adding, subtracting, multiplying, and comparing integers, typically approximately 256-bits in length. Obviously any integer math is trivial to prove, and so can be fixed-precision or rational numbers. But general floating point types can’t be efficiently encoded as operations on integer values with this precision. So you’d have to either (1) restrict yourself to fixed precision numbers (which also avoids all the famous problems with floating point math exploited in the story), or (2) use the equivalent of software-defined floating point on top of arithmetic circuits, which causes proof sizes and computation time to scale inversely proportional to how much slower software floating point is compared with hardware (which is a lot). No exaggeration, if your zk-proof took about a second to compute and is tens of kilobytes in size—typical for real systems used—then a floating-point math version might take minutes or hours to compute and be megabytes in size. Totally impractical, so no, no one does this.
(If you want a crypto exploit that allows for arbitrary inflation, I would have used a co-factor vulnerability like the one that Monero was hit with back in 2017, or a weakness in the inner-product argument of bulletproofs used in mimblewimble, or a weakness in the pairing curve used for zcash proofs, etc. Not floating-point.)
I’ll take your word on Engrish. I’ve never used that word online so I don’t know what the custom is here. Just speaking as someone who has spent significant time in Taiwan and Japan, I’ve only seen that word used among expats in Japan. The construction of the word is particularly specific to Japanese, which does not distinguish between the the l and r phonemes. Mandarin does however make that distinction. Chinese speakers have many issues with English, to be sure, but this isn’t one of them. I can see how the word could have taken broader meaning outside of the context in which it was coined, however.
The 5 orders of magnitude number comes from a rule of thumb for the general speedup you can get for reducing complex but highly parallel computation to ASIC implementation using state of the art process nodes. It is, for example, the rough speedup you get from moving from GPU to ASIC for bitcoin mining, and I believe for hardware raytracing it is the same. Neural nets are outside my area of expertise, but from afar I understand them to be a similar “embarrassingly parallel” application where such speedups can occur. I’m open to being shown wrong here. However that multiplier also shows up independently in latency numbers: HPC switching (e.g. Infiniband) can be sub-100ns, but inter-cloud latency is in the 10s of ms. That’s a factor of 100,000x. I felt I was being generous in assuming that one of these effects will bottleneck, but it is also possible there’d be a larger combined slow down.
None of those points are central to the question of whether a hard take-off is possible, however. But they are essential to a heuristic I use to evaluate whether someone’s claims are credible: if you wander outside of your area of expertise and into mine, I assume you at least consulted an expert to review and fact check the basic stuff. If you didn’t, why should I trust anything you say about other domains, like neural net architectures? Your story hinges on there being a sort of phase transition which causes a step function in the performance and general intelligence of Clippy. You’ve got links to papers whose abstracts seem to back that claim up. But you also similarly hand-waved with citations about floating point and zero knowledge proofs. How do I know your assertions about AI are more credible?
I guess I’m a bit crusty on this because I feel Eliezer’s That Al Message really did damage by priming people with the wrong intuitions about the relative speed advantages of near-term AI, even presuming a hardware overhang. This story feels like the same sort of thing, and I fear people will accept it as a persuasive argument. Regardless of whether they should.
Your floating point counterargument is irrelevant. Yes, it would be a bad idea. You already said that. You did not address any of my points about bad ideas being really really common in crypto (is mixing in some floating point really worse than, say, using ternary for everything binary? That is a realworld crypto which already exists. And while I’m at it, the FP inefficiency might be a reason to use FP—remember how the Bytecoin and other scams worked by obfuscating code and blockchain), nor did you offer any particular reason to think that this specific bad idea would be almost impossible. People switch between floating and integer all the time. Compilers do all sorts of optimizations or fallbacks which break basic security properties. There are countless ways to screw up crypto; secure systems can be composed in insecure ways; and so on.
You’ll “take my word on Engrish”? You don’t need to, I provided WP and multiple dictionaries. There is nothing hard about “and other Asian languages” or movie examples about going to a Chinese food restaurant and making fun of it. If you don’t know anything about the use of ‘Engrish’ and don’t bother to check a single source even when they are served to you on a silver platter, why on earth are you going around talking about how it discredits me? This is bullshit man. “Spotchecking” doesn’t work if you’re not checking, and using your expertise to check for Gell-Man amnesia doesn’t work if you don’t have expertise. That you don’t even care that you were so trivially wrong bothers me more than you being wrong.
No response to the unprofessional criticism, I see. How unprofessional.
Pity this story is about neural nets, then. In any case, I still don’t see where you are getting 10,000x from or how ASICs are relevant, or how any of this addresses the existing and possible techniques for running NNs across many nodes. Yes, we have specialized ASICs for NN stuff which work better than CPUs They are great. We call them “TPUs” and “GPUs” (you may have heard of them), and there’s plenty of discussion about how the usual CPU->ASIC speedup has already been exhausted (as Nvidia likes to point out, the control flow part you are removing to get those speedups for examples like video codecs is already a small part of the NN workload, and you pay a big price in flexibility if you try to get rid of what’s left—as specialized AI chip companies keep finding out the hard way when no one can use their chips). I mean, just think critically for a moment: if the speedup from specialized hardware vs more broadly accessible hardware really was >>10,000x, if my normal Nvidia GPU was 1⁄10,000th the power of a comparable commercial chip, how or why is anyone training anything on regular Nvidia GPUs? With ratios like that, you could run your home GPUs for years and not get as much done as on a cloud instance in an hour or two. Obviously, that’s not the case. And, even granting this, it still has little to do with how much slower a big NN is going to run with Internet interconnects between GPUs instead of on GPU/TPU clusters.
Gwern, you seem to be incapable of taking constructive criticism, and worse you’ve demonstrated an alarming disregard for the safety of others in your willingness to doxx someone merely to score a rhetorical point. Thankfully in this case no harm was done, but you couldn’t have known that and it wasn’t your call to make.
I will not be engaging with you again. I wish you the best.