Alas, Teoma's search API is down. If it ever returns, you can find great Teoma Search API documentation. For the meantime, here's code to do scraping for you:
/////
// fetches URLs from Teoma results for the query $query
// string $query is the search query
// int $querysize is the number of results needed
// int $offset says where the results should begin from (put 11 to get results 11-20)
function fetchTeomaResults($query, $querysize, $offset) {
$page = 1 + intval($offset / 10);
$requestUrl = 'http://www.ask.com/web?q='.urlencode($query).'&page='.$page;
$oldua = ini_set('user_agent', 'Please bring back http://xml.teoma.com/.');
$response = file_get_contents($requestUrl);
ini_set('user_agent', $oldua);
preg_match_all('|<a id="r[0-9]+_t" href="(.+?)"|', $response, $matches);
$results = array_slice($matches[1], 0, $querysize);
return $results;
}
It's dirty, nasty, and many other mean things. For example,
- anyone sane wouldn't enable fopen wrappers;
- the user agent is a little non-standard;
- there's no HTTP From: header;
- it's quite possibly against Ask TOS;
- $offset should be a factor of ten, because I can't be bothered writing preference setting code and controlling the number of results per page isn't controllable via URL (as far as I can see)
- scraping is never a permanent solution
- and other things. Bring back xml.teoma.com!
Use of fetchTeomaResults is usually wrapped up by another function, for accessing SERPs in general and aggregating results. The function signature conforms to this - else we could just specify a page instead of an offset.
Enjoy.
No comments:
Post a Comment