Data and “tokens” a 30 year old human “trains” on

I did some calculations with a bunch of assumptions and simplifications but here’s a high estimate, back of the envelope calculation for the data and “tokens” a 30 year old human would have “trained” on:

  • Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/​s = 162.5 MB/​s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes

  • Auditory data: Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/​s = .18 MB/​s over 30 years = .167 Petabytes

  • Tactile data: 4 million touch receptors providing 8 bits/​s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/​s over 30 years = 4.73 Petabytes

  • Olfactory data: We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/​s = 0.0000050 MB/​s over 30 years = .000004 Petabytes

  • Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/​s = 0.00375 MB/​s over 30 years = .00035 Petabytes

    This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) amounts to 31,728 T tokens

    This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound. Here’s the google sheet if anyone wants to copy it or contribute

    Discussion
    The motivation for this was chinchilla’s wild implications post by nostalgebraist and in general the idea that humans “need” much less data to train on than AI.

    According to these calculations humans “train” on a lot of Petabytes, around 5 Petabytes per year which amounts to ~1000 T tokens. Given that we’re only training our current models on the order of 1 T tokens, the argument that the human brain is more efficient at learning with less data than current LLMs is not fair. Geoff Hinton changing his mind recently about the efficiency of backpropagation from a less efficient to a much more efficient learning algorithm than what the brain is doing, reinforces this point.