Will the system described in the recent Google’s patent become a new ranking algorithm to augment the existing PageRank?
From the very beginning, Google’s distinctive feature was the hyperlink induced popularity ranking. Algorithms using text content to evaluate relevancy of web documents played much lesser role. The reasons to this disparity are purely pragmatical: authors of web documents have total control over their content and are at liberty to modify it to deceive ranking algorithms and get higher positions in search results. Hyperlinks however are much less influenced by webmasters and provide a more reliable measure of authority (link weight) and relevance (link anchor).
Now Google introduces a new way to evaluate relevancy of a web document based on its content which might prove itself to be immune to manipulation attempts such as adjusting the keyword density or the automated generation of keyword-rich web pages. Actually the new system can become a remedy against MFA (Made For AdSense) sites that display meaningless scrapped keyword-rich content with paid contextual advertisements.
The new indexing and ranking system is based on the use of phrases. From a user’s point of view search queries in most cases are phrases or ‘concepts’, rather than sets of keywords. Despite this, conventional indexing systems still rely on individual terms. Indexing of phrases is avoided because the identification of all possible combinations of words would require immense computational and memory resources. For example a lexicon of 200,000 unique words could have approx. 3.2×1026 phrases – with no system capable to store such a great amount of data in memory or efficiently manipulate it.
This problem is solved in the new system, which identifies phrases that are sufficiently frequent and distinguished in the crawled documents. By detecting phrases and indicating that they are ‘valid’ the system can identify multiple word phrases. This eliminates the need to index all the possible combinations of words in phrases that vary in length.
Another important feature is the ability of phrases to predict the presence of other phrases in a webpage. For example a phrase ‘President of the United States’ indicates that the document most likely contains the phrase ‘White House’. For every phrase the system creates a corresponding list of related phrases ordered according to their significance. This enables the system to detect spam pages based on the excessive appearance of related phrases.
So how does the system work?
The process of indexing includes identification of phrases and related phrases. The system analyses the sequences of words and marks them as ‘good’ or ‘bad’ phrases. ‘Good’ phrases are those that occur quite frequently across the indexed documents or have a distinguished appearance, e.g. are delimited by markup tags, punctuation or other markers. Another distinguishing feature is the ability of a ‘good’ phrase to predict a related phrase – such as in above example ‘President of the United States’ predicts ‘White House’. Some phrases, for example, idioms (‘out of the blue’, ‘sitting ducks’ etc) tend to appear with different and unrelated phrases, and are not able to predict anything. Therefore idioms and colloquisms don’t count as ‘good’ phrases.
At the end of the indexing process the system produces a list of valid phrases along with a co-occurrence matrix as a predictive measure. An estimated size of the list is 650,000 phrases.
List of good phrases, or posting list has the following structure:
Phrase i: list:(document d, [list: related phrase count][related phrase information])
For each phrase i there is a list of documents d containing i. For each document there is the number of occurrences of the phrases related to i, and a bit vector containing the information about related phrases.
Bit vector consists of pair of bits. In each pair the value 1 in the first position indicates that a related phrase k is present in the document d; otherwise the value is 0. The second position indicates if a phrase l related to phrase k is present. The related phrases l of related phrases k are called ‘secondary related phrases of i‘. Bit vector is very important as it is used to determine relevancy of a document when the search results are ranked.
Phrase i: document d: [related phrase counts:{3,4,3,0,0,2,1,1,0}]related phrase bit vector:={11 11 10 00 00 10 10 10 01}
For phrase i there are 9 related phrases k. Now take a look at the bit vector. First pair indicates that both related phrase k1 and one of its related phrases l are present in the document. Fourth and fifth pairs show that neither k4 and k5 nor their related phrases l are found, The last pair shows that although there is no occurrence of phrase k9 one of its related phrases l is present.
For each phrase i the documents d are sorted in declining order according to the information retrieval-type score assigned to them with respect to the given phrase. This pre-ranking significantly improves performance of the system. To calculate ranking score the system can employ a link-popularity algorithm such as PageRank.
Phrase Identification. For a detailed description of the process please refer to [1] (paragraphs 0026 – 0102)
The search system receives a query and identifies phrases in it. Once the set Q of query phrases in created; the system retrieves posting lists for the query phrases in Q. Posting lists are intersected to determine, which documents appear on more than one list.
Documents can be ranked according to their bit vector values. A document containing the most relevant phrases has the highest bit vector value and gets the highest ranking. Note that this approach uses the information about related phrases to rank search results, so even documents with low frequency of the query phrase q can get high rankings provided they have sufficiently high frequency of related phrases.
To produce the final ranking score the ‘body hit’ scores calculated above are combined with ‘anchor hit’ scores in a form of a linear function with adjustable weights, e.g.
Rank = (body hit score)*weight1 + (anchor hit score)*weight2.
For each phrase the indexing system also creates lists of documents in which the given phrase is an anchor in incoming and outgoing links. So the anchor hit score for document d can be calculated as a function of the related phrase bit vectors of the query phrases Q, where Q is an anchor term in a document that references document d.
The new phrase based approach enables the future indexing system to detect and penalize spam documents. A statistical analysis of the document collection shows that normally a web page contains 8 to 20 related phrases. A spam document that deceives a search ranking system with an inflated keyword density is expected to contain an excessive number of related phrases, like 100 and more. Therefore by identifying deviations from the expected number of related phrases can be used to detect and battle spam in search results.
This system can also be applied to identify automatically generated content intended to be displayed along with paid contextual advertisements. Such sort of content is often used in MFA (Made for AdSense) sites and is nothing more than a meaningless sequence of keyword-rich text blocks scrapped from other websites, RSS feeds or search engine results pages. Although the conventional indexing systems are already quite effective in preventing these sites from showing in search results for popular terms, they still can occasionally appear in results for long-tail terms.
The new indexing and ranking system proposed by Google uses page content (phrases) to rank search results in a way that is highly immune to manipulation attempts. The properties of a web document used to rank documents, i.e. phrases and relations between them, are influenced by the properties of all the other documents in the index, and therefore are out of control of webmasters.
The phrase based approach also enhances the ability of search engines to detect unnatural patterns in text content, such as inflated keyword density or scrapped content. It also enables search engine to provide more topically focused results by culling documents covering multiple topics.
The new approach can be used as an augmentation to the existing link-popularity based ranking systems as an additional parameter in the final score formula. Link popularity values are also used to pre-rank documents in posting lists to improve the performance of the search system.
1. Patterson, A.L. “Detecting spam documents in a phrase based information retrieval system“, United States Patent Application, 12.28.2006
reddit_url='http://www.seoresearcher.com/googles-new-algorithm-to-rank-pages-and-detect-spam-phrase-rank.htm'
View the original article here
Google’s New Algorithm to Rank Pages and Detect Spam: “PhraseRank”?
| Algorithm, Detect, Googles, Pages, PhraseRank | 0 comments »
Subscribe to:
Post Comments (Atom)
0 comments
Post a Comment