Friday, 29 June 2007

Teoma / Ask scraping code

Alas, Teoma's search API is down. If it ever returns, you can find great Teoma Search API documentation. For the meantime, here's code to do scraping for you:


/////
// fetches URLs from Teoma results for the query $query
// string $query is the search query
// int $querysize is the number of results needed
// int $offset says where the results should begin from (put 11 to get results 11-20)

function fetchTeomaResults($query, $querysize, $offset) {

$page = 1 + intval($offset / 10);
$requestUrl = 'http://www.ask.com/web?q='.urlencode($query).'&page='.$page;

$oldua = ini_set('user_agent', 'Please bring back http://xml.teoma.com/.');
$response = file_get_contents($requestUrl);
ini_set('user_agent', $oldua);

preg_match_all('|<a id="r[0-9]+_t" href="(.+?)"|', $response, $matches);

$results = array_slice($matches[1], 0, $querysize);

return $results;
}



It's dirty, nasty, and many other mean things. For example,

  • anyone sane wouldn't enable fopen wrappers;

  • the user agent is a little non-standard;

  • there's no HTTP From: header;

  • it's quite possibly against Ask TOS;

  • $offset should be a factor of ten, because I can't be bothered writing preference setting code and controlling the number of results per page isn't controllable via URL (as far as I can see)

  • scraping is never a permanent solution


- and other things. Bring back xml.teoma.com!


Use of fetchTeomaResults is usually wrapped up by another function, for accessing SERPs in general and aggregating results. The function signature conforms to this - else we could just specify a page instead of an offset.

Enjoy.

No comments:

 
Marketing & SEO Blogs - Blog Top Sites sitemap