That count topped out around 8 billion pages before it was removed from the homepage. News broke lately through various SEO forums that Google had abruptly, over the previous few weeks, added another couple billion pages into the catalog. This may sound as a cause of celebration, but this “accomplishment” would not reflect well on the search engine that attained it.
What had the SEO community buzzing was that the nature of the fresh, new few billion pages. They were blatant spam- comprising Pay-Per-Click (PPC) ads, scraped content, and they were, in many cases, showing up well in the research results. They pushed out much older, more established sites in doing so. A Google representative responded via forums into the problem by calling it a “bad data push,” something which met with different groans throughout the SEO community.
How did someone figure out how to dupe Google into indexing so lots of pages of spam in this brief period of time? I will offer a top level overview of the process, but don’t get overly excited. Just like a diagram of a nuclear explosive is not going to teach you how you can create the real thing, you are not likely to be able to run off and do it yourself after reading this article. Yet it makes for an interesting tale, one that illustrates the ugly problems cropping up with ever increasing frequency from the world’s most popular search engine.
A Dark and Stormy Night
Our story starts deep in the heart of Moldva, sandwiched scenically between Romania and the Ukraine. Between fending off local vampire strikes, an enterprising neighborhood had a brilliant idea and ran with it, presumably away from the Turks… His thought was to exploit the way Google handled subdomains, rather than just a little bit, but in a big way.
The center of the matter is that currently, Google treats subdomains much the same manner as it treats complete domains- as unique entities. This implies it’s going to include the homepage of a subdomain to the indicator and reunite at some point after to perform a “deep crawl.” Deep crawls are only the spider following links from the domain homepage deeper in the website until it locates everything or gives up and comes back later for more.
Briefly, a subdomain is a “third-level domain.” You’ve probably seen them before, they look something like that: subdomain.domain.com. Wikipedia, for instance, uses them for languages; the English variant is “en.wikipedia.org”, the Dutch variant is “nl.wikipedia.org.” Subdomains are just one way to organize large sites, instead of multiple directories or even distinct domain names entirely.
Therefore, we’ve got a sort of page Google will index virtually “no questions asked.” It is a wonder no one exploited this situation earlier. Some commentators believe the reason for that may be this “quirk” was released following the current “Big Daddy” update. Our Eastern European buddy got together some servers, content yelp data scraper, spambots, PPC accounts, and also some all-important, very inspired scripts, and blended them all together thusly…
Five Billion Served- And Counting…
To begin with, our hero here crafted scripts for his servers that would, when GoogleBot dropped by, begin generating an essentially endless number of subdomains, all with one page containing keyword-rich scraped content, keyworded links, and PPC ads for those keywords. Spambots are shipped out to put GoogleBot on the odor through referral and comment spam to tens of thousands of blogs around the world. The spambots supply the broad setup, and it doesn’t take much to find the dominos to fall.
GoogleBot finds out the spammed links and, because of its purpose in life, follows them into the community. Once GoogleBot is sent into the net, the scripts running the servers only keep generating pages- page after page, all with a unique subdomain, all with keywords, scraped content, and PPC advertisements. These pages have indexed and suddenly you’ve got yourself a Google index 3-5 billion webpages thicker in under 3 months.
Reports indicate, at first, the PPC advertisements on these pages were from Adsense, Google’s very own PPC support. The best irony then is Google benefits financially out of all of the impressions being billed to AdSense users as they appear across these countless junk pages. The AdSense earnings from this endeavor were the point, after all. Cram in numerous webpages which, by sheer force of numbers, individuals would find and click on the advertisements in these pages, making the spammer a great profit in a really short amount of time.
What is Broken?
Word of the achievement spread like wildfire from the DigitalPoint forums. The “general public” is, as of yet, out of the loop, and will likely remain so. A response by a Google engineer emerged on a Threadwatch thread about the subject, calling it a “bad data push”. Basically, the business line was they have not, in fact, additional 5 billions pages. Later asserts include assurances the problem will be fixed algorithmically. Those after the problem (by tracking the known domains the spammer was using) see only that Google is removing them in the index manually.
The tracking is accomplished with the “site:” command. A control which, theoretically, displays the total number of indexed pages in the website you specify after the colon. Google has already admitted there are issues with this control, and “5 billion pages”, they appear to be promising, is only another symptom of it. These issues extend beyond only the site: control, but the display of the number of outcomes for many queries, which some believe are highly inaccurate and sometimes fluctuate wildly. Google admits they have indexed some of these spammy subdomains, but so far haven’t provided any alternate numbers to dispute that the 3-5 billion revealed initially through the site: command.
Over the past week the number of the spammy domains & subdomains indexed has steadily dwindled as Google personnel remove the listings manually. There has been no official announcement that the “loophole” is closed. This presents the obvious issue that, because the manner was shown, there’ll be a number of copycats rushing to cash in before the algorithm has been changed to take care of it.