Saturday 30 June 2007

1400+ PHP Link Directories, catalogued for you

So, in advance of a little scripting, here's a list of over 1400 installations of the PHP Link Directory. I'd like to give more - there are easily 20k out there - but sadly Yahoo! and Google's search APIs don't like queries past result number 1000. In fact, they positively hurl their dummies out of their respective cradles.

The spreadsheet containing the results has the following information:


  • Submit URL

  • Cost of submission

  • Reciprocal link code

  • Whether or not the site uses a captcha

phpld-20070630.xls

It's spartan, but functional, and certainly open to further use. I was surprised by how many of these directories are wide open; further code is coming.

There's no source with this post as the abomination that created the data was truly awful, and probably still will be next time round. I even spent time in Excel updating individual entries; big to-do list entries include: add homepage pr, make backlink scraping code more accurate, de-dupe by domain and not hostname, add express submission price column, add flag to detect if unique sessions are required for submission, and detect PHPld version.

This list's probably very abusable. For example, those lovely chaps at the PHP Link Directory could abuse it to check that everyone's bought a license. In fact, this list should only contain the cheapskates that haven't paid to remove the link to the software creators, but hey, who am I to judge.

Friday 29 June 2007

Yahoo! Stores - hard coded duplicate content

I read this post on the Yahoo! Stores blog. The Yahoo! stores blog is there to scratch the surface of online marketing for merchants new to the scene; it's probably great for giving people introduction to subjects, and leads to follow up, but for old dogs there's not a huge amount of new information. It certainly lets us see that Yahoo!'s helping its merchants, and that they're doing well from their help and the amazing Yahoo! store system. Anyway, in this SEO-oriented posts, Karl Ribas brought up some valid points, including a little intro to duplicate content:

Duplicate content was a pretty big concern at SMX, as having non-unique content on your website is quickly becoming a bigger and bigger problem for online merchants. ... From a search engine’s point-of-view, their one and only goal is to serve a variety of quality results per query, not multiple versions of the same content


Fantastic advice!

Yahoo! stores have cleverly helped us out here. As we all know, visiting the root URL - / - of a domain should really show the homepage; no redirects, no frames, just a plain and easy HTTP 200 response with some good content. And, to their credit, Yahoo! have managed this millimetre scale hurdle.

Now, we also know that in most circumstances, it's great to have a link to your homepage on every page in your site, right? After all, it's the most important page, and where people like to navigate from - so great to provide a link to in case they get lost.

Yahoo! have cottoned on to this little nugget of wisdom, and kindly added a link named "home" to the homepage of a site on every one of its sub-pages. Well, kind of. In fact, it's a hard-coded link, using the link text "home" (also hard-coded - heaven forbid anyone decides that using all-lower-case looks awful, or would prefer slightly less heterogenous link text here):

Dear Zack,

I think you might've got your wires crossed. I'd like to change the
small H is my yahoo stores / store editor system, I don't really mind
about yahoo web hosting. There are options to change all the other
tabs, but the name for the "home" page seems kind of elusive, even
though intuitively I expected them to be in the same place. Could you
check and come back to me ?


Hello Leon,

Thank you for contacting us.

It's not possible to change the 'H' in the navigation bar because the
links are hard-coded into the store software.

We apologize for the inconvenience.


This locked-down and widely shown link points to some strange, new page that's mentioned nowhere else in the store - to "/index.html".

<ul id="nav-general"><li><a href="index.html">home</a></li>


"What's this new-fangled index.html?" I hear you cry. "Where's my homepage?". Well, don't worry! Yahoo!'s kindly duplicated your homepage content for you onto this new URL. So search engines can NOT ONLY get your stuff at the root page, as standard, but now you'll find your link weight directly split between links to / - added by you - and links to /index.html - forcibly inserted by Yahoo!.

What do Yahoo! think of this? Can we get it changed?

Hello Leon,

Thank you for writing to Yahoo! Store Support.

Although this feature is not currently available in the Yahoo! Store
software, we do consider your feedback regarding the features you'd like
to see a very important part of how our development team decides which
features to add to the Yahoo! Store software.

We do not currently have an estimated time for if or when this feature
or any other features may be released. However, we do release a regular
newsletter to all of our merchants at the following link:

http://www.insightsforum.com/

You can see previous copies of the newsletters at:

http://store.yahoo.com/vw/merchant-newsletter.html

We appreciate your feedback. We've forwarded your comments to our
development team for review.

We believe this solution should resolve your issue, if it still
persists, please call us at 1-866-800-8092.

Please do not hesitate to reply if you need further assistance.

Regards,

Andre


Thanks Andre! I'm not sure what led you to believe it should resolve my issue, I'm fairly sure you just told me that it wasn't resolvable. Have you tried visiting http://www.insightsforum.com/ ? I'll save you the trouble:

Bad Request (Invalid Hostname)



Well, maybe the archive mentioned has something useful. Let's take a look at the last post:

February 2005-
Note: Insights has switched formats. While you will continue to receive monthly newsletters, all articles are archived on the Insights Forum site rather than a single HTML file.


Thanks Yahoo!. That's pretty good.

Will you stop duplicating my content soon please?

Microsoft AdCenter $50 free clicks

For new users only; a $5 account creation deposit is required. Expires in about 36 hours, so good luck.

http://www.startadcenter.com/MulttipTrav/

Teoma / Ask scraping code

Alas, Teoma's search API is down. If it ever returns, you can find great Teoma Search API documentation. For the meantime, here's code to do scraping for you:


/////
// fetches URLs from Teoma results for the query $query
// string $query is the search query
// int $querysize is the number of results needed
// int $offset says where the results should begin from (put 11 to get results 11-20)

function fetchTeomaResults($query, $querysize, $offset) {

$page = 1 + intval($offset / 10);
$requestUrl = 'http://www.ask.com/web?q='.urlencode($query).'&page='.$page;

$oldua = ini_set('user_agent', 'Please bring back http://xml.teoma.com/.');
$response = file_get_contents($requestUrl);
ini_set('user_agent', $oldua);

preg_match_all('|<a id="r[0-9]+_t" href="(.+?)"|', $response, $matches);

$results = array_slice($matches[1], 0, $querysize);

return $results;
}



It's dirty, nasty, and many other mean things. For example,

  • anyone sane wouldn't enable fopen wrappers;

  • the user agent is a little non-standard;

  • there's no HTTP From: header;

  • it's quite possibly against Ask TOS;

  • $offset should be a factor of ten, because I can't be bothered writing preference setting code and controlling the number of results per page isn't controllable via URL (as far as I can see)

  • scraping is never a permanent solution


- and other things. Bring back xml.teoma.com!


Use of fetchTeomaResults is usually wrapped up by another function, for accessing SERPs in general and aggregating results. The function signature conforms to this - else we could just specify a page instead of an offset.

Enjoy.

 
Marketing & SEO Blogs - Blog Top Sites sitemap