Skip to Main Content (press enter)

Archive for September, 2008

Case studies, blog postings, research papers, tools info and eBook notes

On the Unimportance Of Keyword Density

In short, there is not much importance that should be attached to keyword density1. The theory (myth?) of the importance of keyword density is based on the assumption that search engine’s make a judgement of term importance based on:

  • Keyword density The number of occurrences of a term in its containing document2, divided by the total number of terms in its containing document.

  • Global weight. The number of documents within the whole document collection (in Google’s case, the entire web!) with one or more occurrences of the term.

The Global weight is usually a log value, i.e.

Number of documents in the collection

N

Number of those documents that contain the term

ni

Global

weight

log(N/ni)

1000

1

3

1000

10

2

1000

100

1

1000

1000

0

From the above table, we can see that a term that appears in every document will have a zero term weight. A term that appears in 1 document in a collection of a thousand is 50% ‘weightier’ than one that appears in 10 documents, and 300% times ‘weightier’ than one that appears in 100 documents in the collection.

The term weight (‘importance’) is a function of the local weight divided by the global weight. A few conclusions can be drawn from this.

  • A term repeated n times in a document is not necessarily n times more relevant or meaningful. Google knows this.

  • Global weight is a measure of the specificity of a term over a document collection. Global weight, then, is not a relevancy measure, it is a rarity measure.

  • High search volume terms, or terms with a good conversion ratio, are important regardless of whether these are rare in terms of Global weight.

  • We can use vector calculations to measure the level of similarity between a search query and indexed documents.

The real question, however, is whether this is a good way of deriving a relevance measure.

  • Ranking by similarity in such a way cannot incorporate semantics – the meaningful relationships between words.

  • Ranking by similarity in such a way cannot incorporate information content (‘entropy’).

  • Ranking by similarity in such a way cannot incorporate authority.

More successful ways to calculate relevance

In reality, a search engine considers the following things about the content:

  • Existence: the presence of a keyword or key phrase in the text

  • Proximity: the relative distance between key words/phrases in the text

  • Positioning: the location of the key word/phrase in the text

  • Co-occurrence: the frequency with which terms occur with other terms

  • Subject: the main topic and sub-topics of the text (subject), also

    • Synonyms: similar words e.g., book → tome

    • Hypernyms: more general words e.g., book → publication

    • Hyponyms: more specific words e.g., book → paperback

    • Meronyms: parts e.g., book → page

  • Entropy: the amount of information carried in the text

  • Authority: the trustworthiness of the source

1Although it can be useful as a way of filtering out content spam. Unusually high keyword density, extensive repetition or tortuous grammar can be used as evidence of attempts to game the system.

2Note that when we say ‘document’, we are in fact talking about an index of that document.

(Filed in Blog, September 2nd, 2008)