Saturday, 20 June 2009

Breaking through Google's 1000 result limit

Sometimes we want to get a huge list of URLs from a search engine. For example, you might want to find all the pages linking back to you. However, Google won't return anything past 1000 results; see this search for "south"; we're on page 99, with 10 results per page - and if you scroll to the bottom, you'll see that page numbers run out. So - how can we get more results?

One technique for measuring the propreties of words and phrases in text is n-gram analysis. This counts the numbers of single-word (unigram), two-word (bigram), three-word etc (up to n-word) phrases in a text.

E.g.: given the phrase "The cat sat on the mat", we have the following unigrams:
  • The - 2
  • Cat - 1
  • Sat - 1
  • On - 1
  • Mat - 1
And the following bigrams:
  • The cat - 1
  • cat sat - 1
  • sat on - 1
  • on the - 1
  • the mat - 1
So how does this help us? Well, n-gram counts of large amounts of text tell us what the most common words we'll find are. Once we ignore stopwords (the search engines will), we get terms that we can use as part of a search query to split up the results. If we know that "fish" and "knee" are common words, we could run two queries:
  • knee
  • fish
This would return 2000 links to Of course, some of these pages will have the words both "fish" and "knee" on, so there'll be some kind of overlap, but we'll still get say 1700-1900 useful unique sites. Once we have a good list to exploit, we can take the top 1000 results for our query divided up with 40 different ngrams to get a good 25000-35000 results - way past the 1000 limit usually imposed.

Implementing something like this would probably look like:

URL table - with unique URL field

query = ""
for ngram in ngrams
for page = 1, page < 10, page ++
offset = (page - 1) * 10
results = getGoogleResults(query + " " + ngram, offset)
for result in results
sql("insert ignore into URL values(?)", result)

Of course, this is massively open to optimisation; post in the comments if you have any questions.

To help you out, I've included a list of over 700 of the most common English unigrams, derived from a good web-based source. If you're interested in versions in other languages, or a longer list, let me know why and I'll see what I can do. Here's the link:

Marketing & SEO Blogs - Blog Top Sites sitemap