A friendly math post! I gave up after reading ”...within-cluster sum of squared differences...” :-)
It is easy for a math literate person to over-estimate how obvious certain jargon is to people. Like ‘sum of squared differences’ for example. Squared differences is just what is involved when you are calculating things like standard deviation. It’s what you use when looking at, say, a group of people and deciding whether they all have about the same height or if some are really tall but others are really short. How different they are.
For those who have never had to manually calculate the standard deviation and similar statistics the term would just be meaningless. (Which makes your example a good demonstration of your point!)
Squared differences is just what is involved when you are calculating things like standard deviation
Never mind that; just parse the damn phrase! All you need to know is what a “difference” is, and what “to square” means.
Why, I wonder, do people assume that words lose their individual meanings when combined, so that something like “squared differences” registers as “[unknown vocabulary item]” rather than “differences that have been squared”?
Why, I wonder, do people assume that words lose their individual meanings when combined, so that something like “squared differences” registers as “[unknown vocabulary item]” rather than “differences that have been squared”?
Because quite often sophisticated people will punish you socially if you don’t take special care to pay homage to whatever extra meaning the combined phrase has taken on. Caution in such cases is a practical social move.
It’s also very helpful to know things like why someone might go around squaring differences and then summing them, and what kinds of situations that makes sense in. That way you can tell when you make errors of interpretation. For example, “differences pertaining to the squared” is a plausible but less likely interpretation of “squared differences”, but knowing that people commonly square differences and then sum them in order to calculate an L₂ norm, often because they are going to take the derivative of the result so as to solve for a local minimum, makes that a much less plausible interpretation.
And for a Bayesian to be rational in the colloquial sense, they must always remember to assign some substantial probability weight to “other”. For example, you can’t simply assume that words like “sum” and “differences” are being used with one of the meanings you’re familiar with; you must remember that there’s always the possibility that you’re encountering a new sense of the word.
For those who have never had to manually calculate the standard deviation and similar statistics the term would just be meaningless. (Which makes your example a good demonstration of your point!)
Really? I think I would have understood that sentence before the first time I tried to calculate a standard deviation manually. In general, there are many ways to arrive at an understanding of a concept. I’m very skeptical of statements of the form “you can’t understand X without doing Y first.”
What do you mean? Are you saying that everyone with an average IQ is supposed to be able to understand what it means to minimize the within-cluster sum of squared differences, regardless of education? I don’t know what a standard deviation is either. I am able to read Wikipedia, understand what to do and use it. I know what squared means and I know what differences means. I just expected the sentence to mean more than the sum of its parts. Also I do not call the ability to use tools comprehension. What I value is to know when to use a particular tool, how to use it effectively and how it works.
You could teach stone-age people to drive a car. It would still seem like magic to them. Yet if you cloned them and exposed them to the right circumstance they might actually understand the internal combustion engine once grown up. Same IQ. Same as the server WolframAlpha is running on do possess a certain potential. Yet what enables the potential are the five million lines of Mathematica.
I’d be really surprised if one was able to understand the sentence the first time with a self-taught 1-year educational background in mathematics. That doesn’t mean that there are exceptions, I’m not a prodigy.
I think you’re right. “Sum of squared differences” makes sense as a normal thing to do with data points only if you’ve learned that it’s a measure of how spread apart they are, that it’s equivalent to the variance, and that making the variance small is a good way to ensure that a cluster is “well clumped.” There is a certain amount of intuition that’s built up from experience.
I also want to stress the point that I’m a bit biased(?) when it comes to understanding concepts. Surely I could accept any mathematical method or algorithm at face value. After all I’m also able to use WolframAlpha. But I feel that doesn’t count. At least I do not value such understanding. If you taught a prehistoric man to press some buttons he would be able to control a nuclear facility.
Many people are bothered by the counter-intuitive nature of probability. I have never been more confused by probability than by any other branch of mathematics. I believe that people regard probability as more difficult to understand because they learn about it much later than about other mathematical concepts. For me that is very different because it is all new to me. For me P(Y) ≥ P(X∧(X->Y)) is as (actually more) intuitive than a^2 + b^2 = c^2. The first makes sense in and of itself, the second needs context and proof (at least regarding my gut feeling). I just don’t see how 2 + 2 = 4 is more obvious than Bayes’ theorem. You just learnt to accept that 2 + 2 = 4 because 1.) you encounter the problem very often 2.) you can easily verify its solution 3.) you learn about it early on. But it is not self-evident.
I also want to stress the point that I’m a bit biased(?) when it comes to understanding concepts.
This is something people have noticed and it influences their responses. Aggressive “not understanding” is often considered a sign of bad faith, for good reason.
What I noticed is that everyone seems to assume that my problem to understand the sentence ”...within-cluster sum of squared differences...” was regarding “sum of squared differences” and not “within-cluster”. I don’t know the definition of the concept of a mathematical cluster. What might add to the confusion is that I’m not even sure about the meaning of the English word “cluster”. After that I decided to postpone reading the post. I could take the effort to look everything up of course but thought it would be more effective to read it in future.
Your post simply served as an example of how difficult it can be to read Less Wrong without a lot of background knowledge.
What I noticed is that everyone seems to assume that my problem to understand the sentence ”...within-cluster sum of squared differences...” was regarding “sum of squared differences” and not “within-cluster”.
Not really. I actually wrote a basic explanation of the whole sentence concept by concept but trimmed it down to the part that best illustrated dependence on mathematical background. Saying “within cluster is basically a phrase in English that refers to the same thing that’s in the title of the post” wouldn’t have helped convey the point. :P
It does, however, illustrate a different point. There is a trait related not just to intelligence but also to openness to information and flexible thinking that makes some people more suited than others to picking up and following new topics and ideas based on what they already know and filling in the blanks with their best inference. Confidence is part of it but part of it is social competition strategy embodied at the cognitive level.
There isn’t an explicit mathematical concept of a cluster.
Here’s what K-means does. Say, K is 3.
You try all the possible ways to partition your data points into three groups. You pick the partition that minimizes the sum of squared differences within each group. Then you iterate the procedure.
What do you mean? Are you saying that everyone with an average IQ is supposed to be able to understand what it means to minimize the within-cluster sum of squared differences, regardless of education?
No, approximately the opposite of that. Are you sure you didn’t intend this to be a reply to Peter? It seems to be quite an odd reply to me in the context.
You said that you have been polite in what you previously wrote. I parsed that the way that you agree with Peter de Blanc but that you have chosen to communicate this fact in a way that makes it possible to arrive at the conclusion without stating it. In other words, I should have been able to understand the sentence.
I didn’t reply to Peter de Blanc because I don’t know him and he doesn’t know me and so his statement that he would have understood Y without X doesn’t give me much information regarding my own intelligence. But you have actually read a lot of my comments and addressed me directly in the discussion above.
Interestingly I’m having a discussion (see my previous comments) with Roko if one should tell people directly if they are dumb or try to communicate such a truth differently.
Note polite enough to lie but polite enough to leave off all the caveats and exceptions. Some here could, understand the sentence even with no education in mathematics. Even so, the essentials of what I said was sincere. Piecing together that kind of jargon from the scraps of information available in the context is a far harder task than just understanding the article itself.
It is easy for a math literate person to over-estimate how obvious certain jargon is to people. Like ‘sum of squared differences’ for example. Squared differences is just what is involved when you are calculating things like standard deviation. It’s what you use when looking at, say, a group of people and deciding whether they all have about the same height or if some are really tall but others are really short. How different they are.
For those who have never had to manually calculate the standard deviation and similar statistics the term would just be meaningless. (Which makes your example a good demonstration of your point!)
Never mind that; just parse the damn phrase! All you need to know is what a “difference” is, and what “to square” means.
Why, I wonder, do people assume that words lose their individual meanings when combined, so that something like “squared differences” registers as “[unknown vocabulary item]” rather than “differences that have been squared”?
Because quite often sophisticated people will punish you socially if you don’t take special care to pay homage to whatever extra meaning the combined phrase has taken on. Caution in such cases is a practical social move.
Good observation; I had been subliminally aware of it but nobody had ever pointed it out to me explicitly.
It’s also very helpful to know things like why someone might go around squaring differences and then summing them, and what kinds of situations that makes sense in. That way you can tell when you make errors of interpretation. For example, “differences pertaining to the squared” is a plausible but less likely interpretation of “squared differences”, but knowing that people commonly square differences and then sum them in order to calculate an L₂ norm, often because they are going to take the derivative of the result so as to solve for a local minimum, makes that a much less plausible interpretation.
And for a Bayesian to be rational in the colloquial sense, they must always remember to assign some substantial probability weight to “other”. For example, you can’t simply assume that words like “sum” and “differences” are being used with one of the meanings you’re familiar with; you must remember that there’s always the possibility that you’re encountering a new sense of the word.
Really? I think I would have understood that sentence before the first time I tried to calculate a standard deviation manually. In general, there are many ways to arrive at an understanding of a concept. I’m very skeptical of statements of the form “you can’t understand X without doing Y first.”
I was being polite.
What do you mean? Are you saying that everyone with an average IQ is supposed to be able to understand what it means to minimize the within-cluster sum of squared differences, regardless of education? I don’t know what a standard deviation is either. I am able to read Wikipedia, understand what to do and use it. I know what squared means and I know what differences means. I just expected the sentence to mean more than the sum of its parts. Also I do not call the ability to use tools comprehension. What I value is to know when to use a particular tool, how to use it effectively and how it works.
You could teach stone-age people to drive a car. It would still seem like magic to them. Yet if you cloned them and exposed them to the right circumstance they might actually understand the internal combustion engine once grown up. Same IQ. Same as the server WolframAlpha is running on do possess a certain potential. Yet what enables the potential are the five million lines of Mathematica.
I’d be really surprised if one was able to understand the sentence the first time with a self-taught 1-year educational background in mathematics. That doesn’t mean that there are exceptions, I’m not a prodigy.
I think you’re right. “Sum of squared differences” makes sense as a normal thing to do with data points only if you’ve learned that it’s a measure of how spread apart they are, that it’s equivalent to the variance, and that making the variance small is a good way to ensure that a cluster is “well clumped.” There is a certain amount of intuition that’s built up from experience.
I also want to stress the point that I’m a bit biased(?) when it comes to understanding concepts. Surely I could accept any mathematical method or algorithm at face value. After all I’m also able to use WolframAlpha. But I feel that doesn’t count. At least I do not value such understanding. If you taught a prehistoric man to press some buttons he would be able to control a nuclear facility.
Many people are bothered by the counter-intuitive nature of probability. I have never been more confused by probability than by any other branch of mathematics. I believe that people regard probability as more difficult to understand because they learn about it much later than about other mathematical concepts. For me that is very different because it is all new to me. For me P(Y) ≥ P(X∧(X->Y)) is as (actually more) intuitive than a^2 + b^2 = c^2. The first makes sense in and of itself, the second needs context and proof (at least regarding my gut feeling). I just don’t see how 2 + 2 = 4 is more obvious than Bayes’ theorem. You just learnt to accept that 2 + 2 = 4 because 1.) you encounter the problem very often 2.) you can easily verify its solution 3.) you learn about it early on. But it is not self-evident.
This is something people have noticed and it influences their responses. Aggressive “not understanding” is often considered a sign of bad faith, for good reason.
What I noticed is that everyone seems to assume that my problem to understand the sentence ”...within-cluster sum of squared differences...” was regarding “sum of squared differences” and not “within-cluster”. I don’t know the definition of the concept of a mathematical cluster. What might add to the confusion is that I’m not even sure about the meaning of the English word “cluster”. After that I decided to postpone reading the post. I could take the effort to look everything up of course but thought it would be more effective to read it in future.
Your post simply served as an example of how difficult it can be to read Less Wrong without a lot of background knowledge.
Not really. I actually wrote a basic explanation of the whole sentence concept by concept but trimmed it down to the part that best illustrated dependence on mathematical background. Saying “within cluster is basically a phrase in English that refers to the same thing that’s in the title of the post” wouldn’t have helped convey the point. :P
It does, however, illustrate a different point. There is a trait related not just to intelligence but also to openness to information and flexible thinking that makes some people more suited than others to picking up and following new topics and ideas based on what they already know and filling in the blanks with their best inference. Confidence is part of it but part of it is social competition strategy embodied at the cognitive level.
There isn’t an explicit mathematical concept of a cluster.
Here’s what K-means does. Say, K is 3.
You try all the possible ways to partition your data points into three groups. You pick the partition that minimizes the sum of squared differences within each group.
Then you iterate the procedure.
No, approximately the opposite of that. Are you sure you didn’t intend this to be a reply to Peter? It seems to be quite an odd reply to me in the context.
You said that you have been polite in what you previously wrote. I parsed that the way that you agree with Peter de Blanc but that you have chosen to communicate this fact in a way that makes it possible to arrive at the conclusion without stating it. In other words, I should have been able to understand the sentence.
I didn’t reply to Peter de Blanc because I don’t know him and he doesn’t know me and so his statement that he would have understood Y without X doesn’t give me much information regarding my own intelligence. But you have actually read a lot of my comments and addressed me directly in the discussion above.
Interestingly I’m having a discussion (see my previous comments) with Roko if one should tell people directly if they are dumb or try to communicate such a truth differently.
Note polite enough to lie but polite enough to leave off all the caveats and exceptions. Some here could, understand the sentence even with no education in mathematics. Even so, the essentials of what I said was sincere. Piecing together that kind of jargon from the scraps of information available in the context is a far harder task than just understanding the article itself.