Web PageRank Prediction with Markov Models

Publié par : Economist
Description : Web PageRank Prediction with Markov Models - Disponible sur l'archive ouverte pluridisciplinaire HAL.

Consulter un extrait ci-dessous
Author manuscript, published in "WWW, Beijing : China (2008)"
Web Page Rank Prediction with Markov Models
Michalis Vazirgiannis
Dimitris Drosos
Pierre Senellart
Akrivi Vlachou
INRIA Futurs
Athens Univ. of
INRIA Futurs &
Athens Univ. of
Orsay, France
Economics & Business
Univ. Paris-Sud
Economics & Business
mvazirg@aueb.gr
Athens, Greece
Orsay, France
Athens, Greece
dimdrosos@aueb.gr
pierre@senellart.com
avlachou@aueb.gr
ABSTRACT
on the similarity of the predicted ranked lists to the actual
ones. We focus on the top-k elements of ranked list of Web
In this paper we propose a method for predicting the rank-
pages, since these pages are more important for Web search.
ing position of a Web page. Assuming a set of successive
The problem of predicting PageRank is partly addressed
past top-k rankings, we study the evolution of Web pages in
terms of ranking trend sequences used for Markov Models
URL features. The authors report experiments for PageR-
training, which are in turn used to predict future rankings.
ank predictions using the extracted features using linear re-
The predictions are highly accurate for all experimental se-
gression; however, the complexity of this approach grows
tups and similarity measures.
linearly in proportion to the number of features used. An
approach that aims at approximating PageRank values with-
Categories and Subject Descriptors
out the need of performing the computations over the entire
H.2.8 [Database Management]: Database Applications -
graph is that of Chien et al. [1]. The authors propose an ef-
Data Mining
to PageRank, based on the evolution of the link structure
General Terms
estimates of cumulative PageRank scores.
Algorithms, Experimentation, Measurement
THE PREDICTIONS FRAMEWORK
Keywords
Rank Change Rate. In order to predict future rankings
Ranking prediction, Markov Models
of Web pages, we have adopted a measure (racer ) we intro-
INTRODUCTION
Given the huge size of the Web graph, computing rankings
normalized rank of page p at time ti.
Given a set of successive Web graph snapshots, for each
matrices whose size is of the order of the size of the Web
page we generate a sequence of rank change rates that indi-
(109 nodes). Moreover, the ranking algorithm should be ap-
cates the trends of this page among the previous snapshots.
inria-00260431, version 1 - 4 Mar 2008
plied on recent Web graph snapshots to guarantee fresh and
We use these sequences of previous snapshots of the Web
accurate results, which implies continuous crawling. This
graph as a training set for learning predictors to forecast
requirement poses a tough problem as it is practically im-
the trends of a Web pages, with the following phases:
possible to crawl the whole Web graph in reasonable time
a. Computing rank trends. We assume a set of succes-
intervals due to its huge size and dynamic nature.
Thus, given a high quality page ranking prediction mecha-
Each Web page is associated with a rank position, indicat-
nism, with known temporal robustness (i.e., prediction accu-
ing its importance in the particular snapshot. Thereafter,
racy deterioration with time), a search engine can optimize
b. Markov Model (MM) states learning. To ensure that
the prediction accuracy falls under a threshold.
Another
the accuracy and the coverage of the MM is high, distinct
important industry motivating rank predictions is advertis-
values appearing in rank change sequences are reduced to a
ing: pages with future high rank are attractive for placing
manageable size by mapping each value to a representative
advertisements.
equi-probable value. These are used as states of the MM .
In this paper we propose a framework that enables pre-
diction of the ranking of Web pages, based on previous rank
positions. We capitalize on trends of the Web pages through
the objective is to partition the data range [r1, rd] of R into n
non-overlapping adjacent partitions Ri each corresponding
the Web graph. We evaluate the prediction quality based
to a data range [rli, rui], with rli, rui the lower and the upper
value of the Ri data range. Our algorithm takes as an input
the set of the distinct racer values R and their frequencies
Copyright is held by the author/owner(s).
F and returns the set S of the states of the Markov Model.
ACM 978-1-60558-085-2/08/04.
(a) OSim for DBLP
(b) KSim for DBLP
(c) OSim for Internet Archive (d) KSim for Internet Archive
Figure 1: Prediction accuracy vs top-k list length: DBLP and Internet Archive datasets
for each pair of consecutive graph snapshots. Based on these
to a reasonable prediction accuracy. The MM is trained
based on previous snapshots of the Web graph and is used
Then for each page p we predict a ranking which is compared
to predict the future rank of a Web page, by matching its
to its actual ranking after using 10-fold cross validation.
current ranking change rate sequences to the MM paths.
In Figure 1, we present the prediction accuracy of the
framework for 1st- and 2nd-order MM s, compared to the
cessive crawls with respective snapshots.
static baseline scheme. The prediction quality is depicted
page, assuming it survives in all graph snapshots, a se-
by the respective OSim and KSim values. For each experi-
ment we calculate the similarity between the predicted and
sequences are used to construct an m-order Markov Model
the actual ranking and measure the prediction quality for
bilities for every path, using the generated racer sequences,
Figure 1 illustrate the predictive quality of the framework
the future racer values can be predicted using the chain
rule: we predict the most probable value X based on the
ond one with KSim. In both cases, our predictions outper-
form the baseline. Our predictions for the Internet Archive
dataset can be seen in the last two graphs of Figure 1. Again
the proposed framework outperforms the static baseline for
future racer values for each Web page pj and therefore the
both similarity measures. The 2nd-order MM performs sys-
future ranking position of pj .
tematically better than the 1st-order MM.
EXPERIMENTAL EVALUATION
CONCLUSION
In this paper, we propose a method for predicting the fu-
We performed experiments on two real-world datasets: a
ture rank of a Web page. We conducted experiments on real-
bibliometric citation graph derived from DBLP (http://
world datasets that yield very encouraging results, achieving
dblp.uni-trier.de/) and a subset of the European Internet
high prediction accuracy. Our predictions outperform the
Archive (http://www.europarchive.org/ukgov.php). In our
various baseline approaches we have adopted for all similar-
experiments, we evaluate the prediction quality in terms of
ity measures and MM orders, across all datasets used. This
similarities between the predicted top-k ranked lists and the
framework is thus a meaningful tool for rank predictions in
actual ones, with two classical similarity measures for com-
inria-00260431, version 1 - 4 Mar 2008
the Web graph. Further work will focus on the following
paring top-k lists [3]: OSim (that captures the degree of
issues: i. Conduct experiments for query-based ranked lists,
overlap between the lists) and KSim (that captures the de-
Elaborate on statistical learning methods for a priori
gree of agreement between the lists).
selection of the optimal MM parameters.
We built a graph structure from the DBLP dataset as fol-
lows: Nodes of the graph represent a publication and Edges
represent the citations between papers, creating thus twelve
REFERENCES
[1] S. Chien, C. Dwork, R. Kumar, D. R. Simon, and
ods. Note that the structure of the DBLP graph is highly
D. Sivakumar. Link evolution: Analysis and algorithms.
no links can appear for old papers to new papers. The In-
[2] J. V. Davis and I. S. Dhillon. Estimating the global
ternet Archive dataset comprises of approximately 500, 000
PageRank of Web communities. In Proc. KDD,
pages and refers to weekly collections of eleven UK gov-
Philadelphia, USA, Aug. 2006.
ernment websites. We obtained 24 graph snapshots evenly
[3] T. H. Haveliwala. Topic-sensitive PageRank. In Proc.
distributed in time between Mar. 2004 and Jan. 2006.
WWW, Honolulu, USA, May 2002.
We evaluate the prediction performance of the predic-
[4] M.-Y. Kan and H. O. N. Thi. Fast webpage
tion framework involving both datasets. We also consider
a baseline prediction schemes, static, that just returns the
Bremen, Germany, Oct. 2005.
top-k list of the previous snapshot (i.e., we consider that all
[5] A. Vlachou, K. Berberich, and M. Vazirgiannis.
pages remained at the same rank). We computed PageRank
Representing and quantifying rank-change for the Web
scores for each snapshot of the DBLP and Internet Archive
datasets, ordered the pages and calculated the racer values
Web PageRank Prediction with Markov Models
Publier sur Facebook Publier sur Twitter
Informations
Date :

03/02/2011


Langue :

Anglais


Pages :

2


Consultations :

8637


Note :  
Téléchargement Gratuit
  • Votre email n'est pas valide

    Vous devez valider les conditions d'utilisation

    J'accepte les conditions d'utilisation

Résumé

Auteur : Michalis Vazirgiannis, Dimitris Drosos, Pierre Senellart, Akrivi Vlachou


Tags : Article de recherche, Informatique
Sur le même thème
Vues : 5721
Description :
Les fusions-acquisitions et l'analyse économique du droit : approche comparée France-Etats-Unis - Disponible sur l'archive...
Vues : 2501
Description :
La sidérurgie française 1945-1979. L'histoire d'une faillite. Les solutions qui s'affrontent. La politique gouvernementale...
Vues : 2225
Description :
Le commerce en ligne des oeuvres d'art - Disponible sur l'archive ouverte pluridisciplinaire HAL.
Vues : 1102
Description :
Qu'est-ce qu'une firme (-réseau) ? - Disponible sur l'archive ouverte pluridisciplinaire HAL.
Vues : 933
Description :
Les effets des fluctuations du prix du pétrole sur les marchés boursiers dans les pays du Golfe. - Disponible sur l'archive...
Vues : 705
Description :
Les centres d'appel en France : mobilisation et mobilité des salariés face à un système hybride de travail - Disponible sur...
Du même contributeur
Vues : 3958
Description :
Tourisme et internationalisation : le cas du groupe Accor - Disponible sur l'archive ouverte pluridisciplinaire HAL.
Vues : 2203
Description :
L'intérim : un secteur dual, entre protection et précarité - Disponible sur l'archive ouverte pluridisciplinaire HAL.
Vues : 2171
Description :
Les principaux courants de pensée économique. http://creativecommons.org/licenses/by-sa/3.0/
Vues : 1884
Description :
Les évolutions du cadre juridique du droit de la formation professionnelle continue : un changement de paradigmes ? -...
Vues : 1624
Description :
Depuis 2008, le pétrole attire encore une fois l’attention de la communauté internationale en raison des fortes variations...
Vues : 1316
Description :
Les théories de la justice distributive. Comme dans les théories déjà présentées, un des problèmes est encore de savoir...
Commentaires
Aucun commentaire pour cette publication
Ajouter un commentaire
Envoyer
Pour envoyer la page de votre document, notez ici les emails destinataires de votre demande :
Séparez les emails par des virgules
Signaler un abus
Vous devez vous connecter ou vous inscrire pour noter un document.
Cliquez ici pour vous inscrire.
Vous devez vous connecter ou vous inscrire pour ajouter un commentaire.
Cliquez ici pour vous inscrire.
Vous devez vous connecter ou vous inscrire pour envoyer le document.
Cliquez ici pour vous inscrire.
Vous ne pouvez pas acheter de documents sur Needocs.
Vous pouvez vous référer aux conditions générales de vente et d'achat du portail pour connaître les modalités d'achat.