SEO Rant: June 2008

Unique content's a good thing, right? Not always.

To get well written text, you need to invest some time, or pay someone skilled to do it for you. Both are expensive and slow exercises, so many SEOs choose to take a shortcut and "spin" articles. This generally involves taking one source text, and altering the language inside it to "create" a new article. The better spinning programs are aware of things like grammar rules, word frequencies in various languages, set phrases and idioms, and so forth. The worse - and predominant - spinners simply perform synonym replacement, which produces this kind of mapping:

"We walked to the large house" would be replaced by "We ambulated to the gigantic residence".

The latter looks ugly, is awkward to read, and generally doesn't fulfil any kind of quality standard - though it is different, thus helping create unique content, and the meaning remains pretty much the same. The massive failing here, though, that utterly defeats the goal of those using spinners, is how trivial the product is to detect.

In every language, there are common words, rare words, and everything between. The probability of individual words occuring in a piece of text is fairly constant, being skewed a bit depending on the document's type and domain (agricultural reports will be more likely to contain terms about farming, horticulture, and plant and chemical proper nouns, for example). The core set of terms, and their frequencies, will remain the same.

The direct synonym replacement used by typical spinning programs (spinners) has two problems. Firstly, one must bear context in mind when picking alternative words. For example, "junk" (used as a noun) can also be:

"boat, clutter, debris, discard, dope, dreck, dump, flotsam, garbage, jetsam, jettison, litter, refuse, rubbish, salvage, scrap, ship, trash, waste"

Depending on whether we're using this word to talk about a sailing vessel or a piece of rubbish, we can divide this set of alternatives into two distinct groups with different meanings. Simply using a thesaurus to pick a random replacement word will often change the meaning of a sentence. "I thought your product was a heap of junk" is not semantically equivalent to "I thought your product was a heap of boat". Note how simplistic substitution also makes the sentence grammatically incorrect in this case.

The second problem with direct synonym replacement is that it doesn't care about the probability distribution of words in a language. This always leads to the inclusion of rarer words, and exclusion of more common ones. In one above example, we used ambulate instead of walk; the former is a comparatively rare word. Using it makes the sentence more awkward, and harder to read (protip: always use the simplest language that you can).

Just to show how easy spun pages are to spot, let's find one, and take it apart, then see how abnormal it is. We're going to first find how frequent words are in English, and then use them to compare a spun article to a previous post on my blog.

The reference frequency list we use to represent general English comes from the British National Corpus. This is in British English, so we'll make things fairer by Anglicising the spun document, making "color" into "colour", "center" into "centre", and "-ize" into "-ise".

We'll mathematically compare both the spun and un-spun text against this reference model of the English language. This can be done by first building a list of words used in a document, and then counting how many times this occurs in the document. Dividing the count of a word by the total number gives the probability than any random token from the document will match that word. We'll also have a list of these probabilities from the British National Corpus (BNC). To compare this, we'll take the absolute difference between measured and reference probabilities for each term, and express that as a percentage of the reference likelihood. As these percentages get pretty high, and to reduce the impact of any anomalous data, we'll also measure the log of the difference measure.

Spun document

Taken from Good Articles Recommend Top Rank by SEO, an almost illegible and probably spun document (it may possibly be badly translated by someone with a newfound love for thesaurii, though given the topic domain - SEO - this seems only minimally likely).

word	freq	prob	bncprob	difference	logdiff
seo	5	0.0125	9.00E-09	138888791.10%	6.142667198
spell-check	1	0.0025	1.80E-08	13888789.11%	5.142664384
overusing	1	0.0025	1.90E-08	13157794.88%	5.119183112
copywriting	2	0.005	5.70E-08	8771829.78%	4.943090196
scruffily	1	0.0025	5.90E-08	4237188.04%	4.627077738
overeat	1	0.0025	8.80E-08	2840809.03%	4.453442039
well-crafted	1	0.0025	1.38E-07	1811494.22%	4.258036952
proofread	1	0.0025	3.05E-07	819572.12%	3.913587174

Full dataset

We can see a few words sticking out here. Some give an indication of the document's topic (SEO, copywriting) while others are quite bizarre (scruffily, overeat). The difference column shows the magnitude of frequency variation from what's expected - a difference of zero means that a word occurs just as frequently in this text as it does in the British National Corpus; a difference of 100% means that the word occurs twice as often or half as often. Note how the words that stick out hugely aren't that congruent as a set - overeating and scruffiness have little to do with copywriting, spell checking and proof reading.

Differences measured this way will be skewed rapidly by any rarer words that come into a document, and every document that has something to say will have to incorporate some topics using rarer that don't fit the curve perfectly - this would be expected. So, using this measure, a non-zero difference score is inevitable; logs have been taken to smooth differences in scale. What is significant is where and how much the differences are.

The mean difference from standard English for the unspun document is 947349.85%, and the mean of the logs of the difference measure is ~1.46. These numbers show us how different the words in the spun document are from what would be expected in general language.

Unspun document

Taken from my overall vaguely positive Ubercart review.

word	freq	prob	bncprob	difference	logdiff
cron	2	0.002444988	9.00E-09	27166431.27%	5.434032591
www	2	0.002444988	1.90E-08	12868256.85%	5.109519721
php	1	0.001222494	2.90E-08	4215396.07%	4.624838387
firewall	1	0.001222494	3.80E-08	3216989.14%	4.507449594
uploading	1	0.001222494	3.90E-08	3134499.74%	4.496168238
todos	1	0.001222494	4.80E-08	2546762.24%	4.405988403
upload	1	0.001222494	5.60E-08	2182924.83%	4.33903878
metadata	1	0.001222494	5.80E-08	2107648.07%	4.323798095
poin	1	0.001222494	5.80E-08	2107648.07%	4.323798095

Full dataset

We can guess from the top differences here that the document is related to computing and fairly technical. The biggest differences in word probability are in the range of say 1e6 - 2.7e7, a lot less than the top four in the unspun document, which were from 8e6 all the way up to 1.4e8. The mean difference is half that of the unspun document (524475.55%) and the log differences again significantly smaller (1.18).

Comparison

For good measure, and to illustrate this point clearly, here's a graph. The red line is the spun document, the blue one the unspun one. For a document that completely followed average word frequency, you'd see a line at y=0 (i.e., a flatline).

Visual comparison of terms in a spun and unspun document

This shows that the spun document uses English consistently more unusually than the human-written (unspun) document; the red line is higher than the blue one, and the higher a point is, the more it varies from the British National Corpus' survey of English usage. For reference, that covers ~10 million words in 4000 documents, so it's a fairly good source of comparison data.

We all know about term frequencies (TF); it seems fair to guess that search engines have models of these, and that's it's not computationally intense for them to use TF as one tool to distinguish spam from useful content. When one can pick out spun content so easily (this system took ~40 minutes of coding and juggling in excel to make it look pretty, for one guy), there's really no point bothering to add it to your site.

Of course, a sophisticated document spinner is definitely possible to construct. My point here is, the cheap and common ones only provide a massive bright flag that your site is spam. Avoid them.

Data

A full set of all the produced data, in a pretty and readable format, including the full keyword data, and a large graph, is available online here. The texts actually used for comparison are here (unspun) and here (spun).

Further information

If you feel like exploring English word frequences and getting into that long tail, I can't recommend anything more highly than Wordcount.org.