I think the point is that when you increase the data set, you then expose more work for the parallelism to handle.
If I have a 1kb dataset and I have a partially parallel algorithm to run on it, I will very quickly ‘run out of parallelism’ and find that 1000 processors are as good as 2 or 3. Whereas, if I have a 1pb dataset, same data and same algorithm, I will be able to add processors for a long time before I finally run out of parallelism.
I think the point is that when you increase the data set, you then expose more work for the parallelism to handle.
If I have a 1kb dataset and I have a partially parallel algorithm to run on it, I will very quickly ‘run out of parallelism’ and find that 1000 processors are as good as 2 or 3. Whereas, if I have a 1pb dataset, same data and same algorithm, I will be able to add processors for a long time before I finally run out of parallelism.