A while ago I mentioned how I’d set up some regexes in my browser to alert me to certain suspicious words that might be indicative of weak points in arguments.
I still have this running. It didn’t have the intended effect, but it is still slightly more useful than it is annoying. I keep on meaning to write a more sophisticated regex that can somehow distinguish the intended context of “rather” from unintended contexts. Natural language is annoying and irregular, etc., etc.
Just lately, I’ve been wondering if I could do this with more elaborate patterns of language. It’s recently come to my attention that expressions of the form “in saying [X] (s)he is [Y]” is often indicative of sketchy value-judgement attribution. It’s also very easy to capture with a regex. It’s gone in the list.
So, my question: what patterns of language are (a) indicative of sloppy thinking, weak arguments, etc., and (b) reliably captured by a regex?
(In the back of my mind, I am imagining some sort of sanity-equivalent of a spelling and grammar check that you can apply to something you’ve just written, or something you’re about to read. This is probably one of those projects I will start and then abandon, but for the time being it’s fun to think about.)
The pair “tend to always” or “always tend to”. Sometimes they come off to me as a way to exploit the rhetorical force of “always” while committing only to a hedged “tend to”, in which case they can condense a two-step of terrific triviality into three words. There are likely other phrases that can provide plausibly deniable pseudo-certainty but I can’t think of any.
More generally, the Unix utility diction tries to pick out “frequently misused, bad or wordy diction”, which is a kinda related precedent.
When they come in the form of portentous pronouncements, Daniel Dennett calls these “deepities”; ambiguous expressions having one meaning which is trivially true but unimportant, and another that is obviously false but would be earth-shatteringly significant if it were true.
Sometimes this phrase is harmless, but sometimes it is part of an important enumeration of possible outcomes/counterarguments/whatever. If “the case” does not come with either a solid plan/argument or an explanation why it is unlikely or not important, then it is often there to make the author and/or the audience feel like all the bases have been covered. E.g.,
We should implement plan X. It may be the case that [important weak point of X], but [unrelated benefit of X].
I had the notion a while ago to try to write a linter to aid in tasks beyond code correctness by automatically detecting the desired features in a plethora of objects. Kudos on actually doing it and in a not hare-brained fashion.
As a former Natural Language Processing researcher, the technology definitely exists. Using general vocabulary combined with many (semi-manually generated) regexes to figure out argumentative or weaselly sentences with decent accuracy should be doable. It could improve over time if you input exemplar sentences you came across.
Do you have a recommendation for a good language-agnostic text / reference resource on NLP?
ETA: my own background is a professional programmer with a reasonable (undergrad) background in statistics. I’ve dabbled with machine learning (I’m in the process of developing this as a skill set) and messed around with python’s nltk. I’d like a broader conceptual overview of NLP.
However, tasks like parsing are unnecessary for many tasks. A simple classifier on a sparse vector of word counts can be quite effective as a starting point in classifying sentence/document content.
A while ago I mentioned how I’d set up some regexes in my browser to alert me to certain suspicious words that might be indicative of weak points in arguments.
I still have this running. It didn’t have the intended effect, but it is still slightly more useful than it is annoying. I keep on meaning to write a more sophisticated regex that can somehow distinguish the intended context of “rather” from unintended contexts. Natural language is annoying and irregular, etc., etc.
Just lately, I’ve been wondering if I could do this with more elaborate patterns of language. It’s recently come to my attention that expressions of the form “in saying [X] (s)he is [Y]” is often indicative of sketchy value-judgement attribution. It’s also very easy to capture with a regex. It’s gone in the list.
So, my question: what patterns of language are (a) indicative of sloppy thinking, weak arguments, etc., and (b) reliably captured by a regex?
(In the back of my mind, I am imagining some sort of sanity-equivalent of a spelling and grammar check that you can apply to something you’ve just written, or something you’re about to read. This is probably one of those projects I will start and then abandon, but for the time being it’s fun to think about.)
The pair “tend to always” or “always tend to”. Sometimes they come off to me as a way to exploit the rhetorical force of “always” while committing only to a hedged “tend to”, in which case they can condense a two-step of terrific triviality into three words. There are likely other phrases that can provide plausibly deniable pseudo-certainty but I can’t think of any.
More generally, the Unix utility diction tries to pick out “frequently misused, bad or wordy diction”, which is a kinda related precedent.
When they come in the form of portentous pronouncements, Daniel Dennett calls these “deepities”; ambiguous expressions having one meaning which is trivially true but unimportant, and another that is obviously false but would be earth-shatteringly significant if it were true.
Also related in cold reading is the Rainbow Ruse.
“[...]may be the case[...]”
Sometimes this phrase is harmless, but sometimes it is part of an important enumeration of possible outcomes/counterarguments/whatever. If “the case” does not come with either a solid plan/argument or an explanation why it is unlikely or not important, then it is often there to make the author and/or the audience feel like all the bases have been covered. E.g.,
I had the notion a while ago to try to write a linter to aid in tasks beyond code correctness by automatically detecting the desired features in a plethora of objects. Kudos on actually doing it and in a not hare-brained fashion.
As a former Natural Language Processing researcher, the technology definitely exists. Using general vocabulary combined with many (semi-manually generated) regexes to figure out argumentative or weaselly sentences with decent accuracy should be doable. It could improve over time if you input exemplar sentences you came across.
Do you have a recommendation for a good language-agnostic text / reference resource on NLP?
ETA: my own background is a professional programmer with a reasonable (undergrad) background in statistics. I’ve dabbled with machine learning (I’m in the process of developing this as a skill set) and messed around with python’s nltk. I’d like a broader conceptual overview of NLP.
I’d recommend this book for a general overview : http://nlp.stanford.edu/fsnlp/
However, tasks like parsing are unnecessary for many tasks. A simple classifier on a sparse vector of word counts can be quite effective as a starting point in classifying sentence/document content.