I would be interested in a more precise definition of what you mean by information here.
In particular, it seems like you’re using an unintuitive (to me) definition of information—though one that’s lines up colloquially with how we talk about computers.
For example, let’s say I have a thumb drive (“Drive A”) with two things on it:
A very short program that computes the digits of pi
A petabyte of computed digits of pi
And I have a second drive with one thing on it:
The millions of lines of source code for the Linux kernel
I might ask someone: which of these has more information on it?
The colloquial computer-storage based answer might be: the first one! It takes up a petabyte, where the second one takes up less than a gigabyte.
But it feels like something important about the meaning of information (in an AI-understanding-the-world-sense) is being lost here.
(ETA: Also, if determinism factors in here, feel free to replace the petabyte of pi digits with something like a petabyte of recordings from a TRNG or something like that.)
The definitions of information I have in mind are some sort of classification or generation loss function. So for instance, if you trained GPT-3 on the second drive, I would expect it to get lower loss on the datasets it was evaluated on (some sort of mixture of wikipedia and other internet scrapes, if I recall correctly) than it would if it was trained on the first drive. So by the measures I have in mind, the second drive would very possibly contain more information.
(Though of course this depends on your exact loss function.)
Like my post is basically based on the observation that we often train machine learning systems with some sort of information-based loss, whether that be self-supervised/generative or fully supervised or something more complicated than that. Even if you achieve a better loss on your model, you won’t necessarily achieve a better reward for your agent.
I would be interested in a more precise definition of what you mean by information here.
In particular, it seems like you’re using an unintuitive (to me) definition of information—though one that’s lines up colloquially with how we talk about computers.
For example, let’s say I have a thumb drive (“Drive A”) with two things on it:
A very short program that computes the digits of pi
A petabyte of computed digits of pi
And I have a second drive with one thing on it:
The millions of lines of source code for the Linux kernel
I might ask someone: which of these has more information on it?
The colloquial computer-storage based answer might be: the first one! It takes up a petabyte, where the second one takes up less than a gigabyte.
But it feels like something important about the meaning of information (in an AI-understanding-the-world-sense) is being lost here.
(ETA: Also, if determinism factors in here, feel free to replace the petabyte of pi digits with something like a petabyte of recordings from a TRNG or something like that.)
The definitions of information I have in mind are some sort of classification or generation loss function. So for instance, if you trained GPT-3 on the second drive, I would expect it to get lower loss on the datasets it was evaluated on (some sort of mixture of wikipedia and other internet scrapes, if I recall correctly) than it would if it was trained on the first drive. So by the measures I have in mind, the second drive would very possibly contain more information.
(Though of course this depends on your exact loss function.)
Like my post is basically based on the observation that we often train machine learning systems with some sort of information-based loss, whether that be self-supervised/generative or fully supervised or something more complicated than that. Even if you achieve a better loss on your model, you won’t necessarily achieve a better reward for your agent.