Sunday, 19 July 2009

How to view HTTP headers

HTTP headers are the part of the webpage you don't see in your browser (usually); the special data describing the page to your browser software. HTTP headers are wehere you'd put redirects, information about whether a file is a PNG / HTML page / RAR archive, and where you say if the browser should open the file or present it as a download - as well as many other things. The full details are in RFC2616.

I'm going to cover three methods of dealing with headers today - all quick, simple and powerful.

Perl's LWP "HEAD" command

This is a commonly-found *nix command line tool, very simple in its operation, and likely already on your system. To use it, you simply enter "HEAD " at the command prompt, where is the full address (e.g. including http:) that you want to check.

If you don't have it, you can install this as root by whipping up a CPAN console (perl -MCPAN -e shell) and running i LWP::Simple - then just follow the prompts, and opt to install the GET/HEAD aliases.

Quick and simple - but it won't report on redirects, just the final page, and you need root to install it in most circumstances.

Tamper Data

You can use Firefox to examine headers, alter HTTP requests, and find out precisely what every page is doing with this masterpiece of a plugin. If you're using LiveHttpHeaders, I suggest you immediately exchange it for Tamper Data - just as light, and much more powerful. To use this, simply enable Tamper Data in Firefox, click "Start tampering" in the new window, and then visit the page you're interested in; you don't want to go playing with the server immediately, so simply accept the request, and ignore further requests. Tada - more information than you'll ever need - including full request and response headers for everything on the page! This is also great for finding out FLV URLs and other things hidden by Flash apps.

Tamper Data has many additional functions, including page load optimisation, and far too much to cover here. Just check out this tutorial for a taster.

Command-line cURL header script

For a very verbose, quick, minimal and to the point solution, create a file called header somewhere on your *nix server, and fill it thusly (perhaps amending the PHP executable path):

#!/usr/bin/env php
$url = $argv[1];

function url_header($url) {
global $useragent;
global $timeout;
if ($useragent == "") {$useragent = "Mozilla 8.0 +http://seorant.blogspot.com";}
if ($timeout == "") {$timeout = 20;}
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_NOBODY, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_MUTE, 1);
$result = curl_exec ($ch);
curl_close($ch);
return $result;
}



$useragent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040614 Firefox/0.8';

echo $url."\n\n"; flush();
echo url_header($url);
?>



Make the file executable (chmod u+x header) and run with ./header

This will include the full details of redirects as and when they're performed, and should get a reponse more akin to that a browser receives when compared to the LWP method, which uses a different useragent string.

Saturday, 20 June 2009

Breaking through Google's 1000 result limit

Sometimes we want to get a huge list of URLs from a search engine. For example, you might want to find all the pages linking back to you. However, Google won't return anything past 1000 results; see this search for "south"; we're on page 99, with 10 results per page - and if you scroll to the bottom, you'll see that page numbers run out. So - how can we get more results?


One technique for measuring the propreties of words and phrases in text is n-gram analysis. This counts the numbers of single-word (unigram), two-word (bigram), three-word etc (up to n-word) phrases in a text.

E.g.: given the phrase "The cat sat on the mat", we have the following unigrams:
  • The - 2
  • Cat - 1
  • Sat - 1
  • On - 1
  • Mat - 1
And the following bigrams:
  • The cat - 1
  • cat sat - 1
  • sat on - 1
  • on the - 1
  • the mat - 1
So how does this help us? Well, n-gram counts of large amounts of text tell us what the most common words we'll find are. Once we ignore stopwords (the search engines will), we get terms that we can use as part of a search query to split up the results. If we know that "fish" and "knee" are common words, we could run two queries:
  • link:mysite.com knee
  • link:mysite.com fish
This would return 2000 links to mysite.com. Of course, some of these pages will have the words both "fish" and "knee" on, so there'll be some kind of overlap, but we'll still get say 1700-1900 useful unique sites. Once we have a good list to exploit, we can take the top 1000 results for our query divided up with 40 different ngrams to get a good 25000-35000 results - way past the 1000 limit usually imposed.

Implementing something like this would probably look like:

URL table - with unique URL field


query = "link:competitor.com"
for ngram in ngrams
for page = 1, page < 10, page ++
offset = (page - 1) * 10
results = getGoogleResults(query + " " + ngram, offset)
for result in results
sql("insert ignore into URL values(?)", result)

Of course, this is massively open to optimisation; post in the comments if you have any questions.

To help you out, I've included a list of over 700 of the most common English unigrams, derived from a good web-based source. If you're interested in versions in other languages, or a longer list, let me know why and I'll see what I can do. Here's the link:

 
Marketing & SEO Blogs - Blog Top Sites sitemap