I love how you guys explore every aspect of a thing. (: That may be a limit that Google has to either save on resources or to prevent rival search engines from downloading their whole database (or, a limit put into place BECAUSE rival search engines were sucking down their database, and it was taking up a lot of resources). I’ve seen other companies figure out who their greediest users are and, upon realizing that the population that takes up the most resources brings the least return on investment, put limits on them. That’s what this looks like to me.
The basic problem with the Google hit count reported in search results, particularly for phrases and searches using “AND” or “OR” operators, is that it is an estimate. It’s not actually a count of anything, at all. It’s the result of a calculation based solely upon the words that the query comprises, as Kevin Marks notes. Google explicitly states that it’s an estimate, although it is coy about what that estimate is actually based upon. To quote one un-named Google employee, “these are all estimates, and we just haven’t tried that hard to make the estimates precise”. A named Google employee said much the same after this frequently given answer had been around for some years.
For example: When Google Web reports 17,200 results for the string “de Boyne Pollard” (as it does at the time of writing this Frequently Given Answer), it hasn’t searched its entire database to count all of the pages that match that string. That would be very inefficient, considering that it only needs to find (by default) 10 matches in its database in order to return a result page, and that many people don’t go beyond the first few pages (or even the first page) of results. What it has done is taken the individual words “de”, “Boyne”, and “Pollard”, and, using the word frequency tables that the Google Web spider generates when it crawls the World Wide Web, produced, from the frequencies with which those three particular individual words occur, an estimate of the number of pages that probably would match.
To demonstrate for yourself that these estimates are meaningless numbers, take a few searches and click on the “Next” button to bring up further pages of results until you reach the last page. You’ll see that the actual number of results, known once you reach the last page, will almost always be nothing like the estimated number of results that appeared on all of the prior pages.
Even the actual page count isn’t necessarily correct. In part this is because Google caps all queries at 1000 results, and in part it is because of several other other problems with the Google hit count, both estimated and final, that exist.
If Google didn’t search it’s entire database, this supports my theory that there are probably “over 9,000 members”—I did clearly say that was on the low side. If Google only totals only SOME of the results (until it’s clear that the user wants more results, or up to it’s limit for resource conservation) this also supports my assertion.
Search Term Interpretation:
As for the issues with word interpretation—I knew about that, so I restricted my search to a specific URL, not text within pages. The entire purpose of Google’s “site:” code is to restrict the query to a particular website, not to use those words as it would a text search. IF it’s breaking the url up into separate words and checking what it’s got for those, firstly, that would fail to restrict the search to a specific site and therefore make that functionality bugged, and secondly even if it did that only for the counter, the word “user” would certainly return way more results than 9,000. The term “user” gets 8 billion hits, and “lesswrong” gets 51,700 - if it’s totaling site: searches that way, it would get billions of results and it didn’t. Assuming it’s not bugged, a misinterpretation of the “site:lesswrong.com/user″ code is N/A. Since every single user page contains the phrase “comments” and “submitted”, if it had broken my exact phrase exclusions into parts, I’d have gotten zero results. See for yourself by trying:
It was not by accident that I used the query that I did.
Is my point unsupported?
IF I were trying to support some sort of important point with this user total, I would agree with the link that it is not scientific evidence and quit using it to support points, but this is N/A because if you look closely, you’ll see that I am not using this as support to convince anybody of anything. My entire purpose was to verify to myself my perception that LessWrong isn’t just someone’s personal website with their buddies on it, that a significant number of people have actually gathered around themes like rational thought. I was overjoyed when I discovered this and wanted to share. Maybe this post will get the attention of someone who has the ability to issue a count command to the database. That’s the only way we can know for sure. Though, of course, the user totals will change over time, becoming inaccurate quickly. Hopefully by increasing. (:
But you reach the last page of results at page 55, which suggests there are around 550 users.
I love how you guys explore every aspect of a thing. (: That may be a limit that Google has to either save on resources or to prevent rival search engines from downloading their whole database (or, a limit put into place BECAUSE rival search engines were sucking down their database, and it was taking up a lot of resources). I’ve seen other companies figure out who their greediest users are and, upon realizing that the population that takes up the most resources brings the least return on investment, put limits on them. That’s what this looks like to me.
The “9000 results” is probably not a very accurate estimate—from “Google result counts are a meaningless metric”:
(The linked page has more sources for this)
Thanks! I’ve always wondered where those numbers came from, but never taken the time to find out.
If Google didn’t search it’s entire database, this supports my theory that there are probably “over 9,000 members”—I did clearly say that was on the low side. If Google only totals only SOME of the results (until it’s clear that the user wants more results, or up to it’s limit for resource conservation) this also supports my assertion.
Search Term Interpretation:
As for the issues with word interpretation—I knew about that, so I restricted my search to a specific URL, not text within pages. The entire purpose of Google’s “site:” code is to restrict the query to a particular website, not to use those words as it would a text search. IF it’s breaking the url up into separate words and checking what it’s got for those, firstly, that would fail to restrict the search to a specific site and therefore make that functionality bugged, and secondly even if it did that only for the counter, the word “user” would certainly return way more results than 9,000. The term “user” gets 8 billion hits, and “lesswrong” gets 51,700 - if it’s totaling site: searches that way, it would get billions of results and it didn’t. Assuming it’s not bugged, a misinterpretation of the “site:lesswrong.com/user″ code is N/A. Since every single user page contains the phrase “comments” and “submitted”, if it had broken my exact phrase exclusions into parts, I’d have gotten zero results. See for yourself by trying:
“site:lesswrong.com/user″ -comments
It was not by accident that I used the query that I did.
Is my point unsupported?
IF I were trying to support some sort of important point with this user total, I would agree with the link that it is not scientific evidence and quit using it to support points, but this is N/A because if you look closely, you’ll see that I am not using this as support to convince anybody of anything. My entire purpose was to verify to myself my perception that LessWrong isn’t just someone’s personal website with their buddies on it, that a significant number of people have actually gathered around themes like rational thought. I was overjoyed when I discovered this and wanted to share. Maybe this post will get the attention of someone who has the ability to issue a count command to the database. That’s the only way we can know for sure. Though, of course, the user totals will change over time, becoming inaccurate quickly. Hopefully by increasing. (: