Words that are harder to predict from context typically carry more information(or surprisal). Does more information/surprisal means more importance, given everything else the same(correctness/plausibility, etc.)?
A simple example: “This morning I opened the door and saw a ‘UFO’.” vs “This morning I opened the door and saw a ‘cat’.” — clearly “UFO” carries more information.
‘UFO’ seems more important here. But is this because it carries more information? This topic may be around the information-theoretic nature of language.
If this is true, it’s simple and helpful to analyze text information density with large language models and visualizes where the important parts are.
It is a world of information, layered above the physical world. When we read text we are intaking information from a token stream and get various information density across that stream. Just like when we recieve things we get different “worth”.
------
Theoretical Timeline
In 1940s: The foundational Shannon Information Theory.
Around 2000, key ideas point toward a regularity in the information-theoretic nature of language:
Entropy Rate Constancy(ERC) hypothesis: Word’s absolute entropy increases with position, thus conditional entropy stays roughly constant across the text.
Uniform Information Density(UID) hypothesis: Humans tend to distribute information as evenly as possible across the text — a kind of “information smoothing pressure” that releases info gradually).
Surprisal Theory: Surprisal correlates almost linearly with reading times / processing difficulty.
Now, LLMs come out. LLMs x information theory — what kind of cognitive breakthrough might this bring to linguistics?
At least right now, one thing I can speculate is: Shannon information seems to represent the upper bound on “importance.”
Are we on the eve of re-understanding the information-theoretic nature of language?
Word importance in text ⇐ conditional information of the token in the context. Is this assumption valid?
Words that are harder to predict from context typically carry more information(or surprisal). Does more information/surprisal means more importance, given everything else the same(correctness/plausibility, etc.)?
A simple example: “This morning I opened the door and saw a ‘UFO’.” vs “This morning I opened the door and saw a ‘cat’.” — clearly “UFO” carries more information.
‘UFO’ seems more important here. But is this because it carries more information? This topic may be around the information-theoretic nature of language.
If this is true, it’s simple and helpful to analyze text information density with large language models and visualizes where the important parts are.
It is a world of information, layered above the physical world. When we read text we are intaking information from a token stream and get various information density across that stream. Just like when we recieve things we get different “worth”.
------
Theoretical Timeline
In 1940s: The foundational Shannon Information Theory.
Around 2000, key ideas point toward a regularity in the information-theoretic nature of language:
Entropy Rate Constancy (ERC) hypothesis: Word’s absolute entropy increases with position, thus conditional entropy stays roughly constant across the text.
Uniform Information Density (UID) hypothesis: Humans tend to distribute information as evenly as possible across the text — a kind of “information smoothing pressure” that releases info gradually).
Surprisal Theory: Surprisal correlates almost linearly with reading times / processing difficulty.
Now, LLMs come out. LLMs x information theory — what kind of cognitive breakthrough might this bring to linguistics?
At least right now, one thing I can speculate is: Shannon information seems to represent the upper bound on “importance.”
Are we on the eve of re-understanding the information-theoretic nature of language?