wassname comments on Ideas for benchmarking LLM creativity

wassname 20 Dec 2024 0:35 UTC
1 point
0
I pretty much agree, in my experiments I haven’t managed to get a metric that scales how I expect it too for example when using adapter fine-tuning to “learn” a text and looking at the percent improvement in perplexity, the document openai_board_ann appeared more novel than wikipedia on LK-99, but I would expect it to be the other way round since the LK-99 observations are much more novel and dense than a corporate announcement that is designed to be vague.

However I would point out that gzip is not a good example of a compression scheme for novelty, as 1) it’s a compression scheme that roughly about word duplication. A language model represents a much more sophisticated compression scheme that is closer to our understanding the text. If we want to measure novelty to us, then we probably want a compression that is similar to how our brain compresses information into memory. That way, something surprising to us, is also hard to compress. And I’d also point out that 2) gzip cannot learn (except in a very basic sense of increased context), so it cannot beat the noisy TV problem.

Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider ‘interesting’ or ‘surprising’.

I agree, but it doesn’t learn so it doesn’t get past the noisy TV problem either, but that is central to Schmidhuber idea. If you are not familiar, the noisy TV problem is this:

“agents are rewarded for visiting regions of the state space that they have not previously occupied. If, however, a particular state transition is impossible to predict, it will trap a curious agent (Burda et al., 2019b; Schmidhuber, 1991a). This is referred to as the noisy TV problem (e.g. (Burda et al., 2019b; Schmidhuber, 1991a)), the etymology being that a naively curious agent could dwell on the unpredictability of a noisy TV screen” from “How to Stay Curious while avoiding Noisy TVs using Aleatoric Uncertainty Estimation”

So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.

I agree, this is true of most of Schmidhuber ideas. Often he does even produce a toy model for years, which means the ideas are generally not very useful. I do like this one, and it has led to some implementations in RL.

I do agree, perplexity doesn’t seem like a great place to start, and your ideas seem like a better way to measure.