Author manuscript, published in "WWW, Beijing : China (2008)"
Web Page Rank Prediction with Markov Models
Athens Univ. of
INRIA Futurs &
Athens Univ. of
Economics & Business
Economics & Business
on the similarity of the predicted ranked lists to the actual
ones. We focus on the top-k elements of ranked list of Web
In this paper we propose a method for predicting the rank-
pages, since these pages are more important for Web search.
ing position of a Web page. Assuming a set of successive
The problem of predicting PageRank is partly addressed
past top-k rankings, we study the evolution of Web pages in
terms of ranking trend sequences used for Markov Models
URL features. The authors report experiments for PageR-
training, which are in turn used to predict future rankings.
ank predictions using the extracted features using linear re-
The predictions are highly accurate for all experimental se-
gression; however, the complexity of this approach grows
tups and similarity measures.
linearly in proportion to the number of features used. An
approach that aims at approximating PageRank values with-
Categories and Subject Descriptors
out the need of performing the computations over the entire
H.2.8 [Database Management]: Database Applications -
graph is that of Chien et al. . The authors propose an ef-
to PageRank, based on the evolution of the link structure
estimates of cumulative PageRank scores.
Algorithms, Experimentation, Measurement
THE PREDICTIONS FRAMEWORK
Rank Change Rate. In order to predict future rankings
Ranking prediction, Markov Models
of Web pages, we have adopted a measure (racer ) we intro-
Given the huge size of the Web graph, computing rankings
normalized rank of page p at time ti.
Given a set of successive Web graph snapshots, for each
matrices whose size is of the order of the size of the Web
page we generate a sequence of rank change rates that indi-
(109 nodes). Moreover, the ranking algorithm should be ap-
cates the trends of this page among the previous snapshots.
inria-00260431, version 1 - 4 Mar 2008
plied on recent Web graph snapshots to guarantee fresh and
We use these sequences of previous snapshots of the Web
accurate results, which implies continuous crawling. This
graph as a training set for learning predictors to forecast
requirement poses a tough problem as it is practically im-
the trends of a Web pages, with the following phases:
possible to crawl the whole Web graph in reasonable time
a. Computing rank trends. We assume a set of succes-
intervals due to its huge size and dynamic nature.
Thus, given a high quality page ranking prediction mecha-
Each Web page is associated with a rank position, indicat-
nism, with known temporal robustness (i.e., prediction accu-
ing its importance in the particular snapshot. Thereafter,
racy deterioration with time), a search engine can optimize
b. Markov Model (MM) states learning. To ensure that
the prediction accuracy falls under a threshold.
the accuracy and the coverage of the MM is high, distinct
important industry motivating rank predictions is advertis-
values appearing in rank change sequences are reduced to a
ing: pages with future high rank are attractive for placing
manageable size by mapping each value to a representative
equi-probable value. These are used as states of the MM .
In this paper we propose a framework that enables pre-
diction of the ranking of Web pages, based on previous rank
positions. We capitalize on trends of the Web pages through
the objective is to partition the data range [r1, rd] of R into n
non-overlapping adjacent partitions Ri each corresponding
the Web graph. We evaluate the prediction quality based
to a data range [rli, rui], with rli, rui the lower and the upper
value of the Ri data range. Our algorithm takes as an input
the set of the distinct racer values R and their frequencies
Copyright is held by the author/owner(s).
F and returns the set S of the states of the Markov Model.
(a) OSim for DBLP
(b) KSim for DBLP
(c) OSim for Internet Archive (d) KSim for Internet Archive
Figure 1: Prediction accuracy vs top-k list length: DBLP and Internet Archive datasets
for each pair of consecutive graph snapshots. Based on these
to a reasonable prediction accuracy. The MM is trained
based on previous snapshots of the Web graph and is used
Then for each page p we predict a ranking which is compared
to predict the future rank of a Web page, by matching its
to its actual ranking after using 10-fold cross validation.
current ranking change rate sequences to the MM paths.
In Figure 1, we present the prediction accuracy of the
framework for 1st- and 2nd-order MM s, compared to the
cessive crawls with respective snapshots.
static baseline scheme. The prediction quality is depicted
page, assuming it survives in all graph snapshots, a se-
by the respective OSim and KSim values. For each experi-
ment we calculate the similarity between the predicted and
sequences are used to construct an m-order Markov Model
the actual ranking and measure the prediction quality for
bilities for every path, using the generated racer sequences,
Figure 1 illustrate the predictive quality of the framework
the future racer values can be predicted using the chain
rule: we predict the most probable value X based on the
ond one with KSim. In both cases, our predictions outper-
form the baseline. Our predictions for the Internet Archive
dataset can be seen in the last two graphs of Figure 1. Again
the proposed framework outperforms the static baseline for
future racer values for each Web page pj and therefore the
both similarity measures. The 2nd-order MM performs sys-
future ranking position of pj .
tematically better than the 1st-order MM.
In this paper, we propose a method for predicting the fu-
We performed experiments on two real-world datasets: a
ture rank of a Web page. We conducted experiments on real-
bibliometric citation graph derived from DBLP (http://
world datasets that yield very encouraging results, achieving
dblp.uni-trier.de/) and a subset of the European Internet
high prediction accuracy. Our predictions outperform the
Archive (http://www.europarchive.org/ukgov.php). In our
various baseline approaches we have adopted for all similar-
experiments, we evaluate the prediction quality in terms of
ity measures and MM orders, across all datasets used. This
similarities between the predicted top-k ranked lists and the
framework is thus a meaningful tool for rank predictions in
actual ones, with two classical similarity measures for com-
inria-00260431, version 1 - 4 Mar 2008
the Web graph. Further work will focus on the following
paring top-k lists : OSim (that captures the degree of
issues: i. Conduct experiments for query-based ranked lists,
overlap between the lists) and KSim (that captures the de-
Elaborate on statistical learning methods for a priori
gree of agreement between the lists).
selection of the optimal MM parameters.
We built a graph structure from the DBLP dataset as fol-
lows: Nodes of the graph represent a publication and Edges
represent the citations between papers, creating thus twelve
 S. Chien, C. Dwork, R. Kumar, D. R. Simon, and
ods. Note that the structure of the DBLP graph is highly
D. Sivakumar. Link evolution: Analysis and algorithms.
no links can appear for old papers to new papers. The In-
 J. V. Davis and I. S. Dhillon. Estimating the global
ternet Archive dataset comprises of approximately 500, 000
PageRank of Web communities. In Proc. KDD,
pages and refers to weekly collections of eleven UK gov-
Philadelphia, USA, Aug. 2006.
ernment websites. We obtained 24 graph snapshots evenly
 T. H. Haveliwala. Topic-sensitive PageRank. In Proc.
distributed in time between Mar. 2004 and Jan. 2006.
WWW, Honolulu, USA, May 2002.
We evaluate the prediction performance of the predic-
 M.-Y. Kan and H. O. N. Thi. Fast webpage
tion framework involving both datasets. We also consider
a baseline prediction schemes, static, that just returns the
Bremen, Germany, Oct. 2005.
top-k list of the previous snapshot (i.e., we consider that all
 A. Vlachou, K. Berberich, and M. Vazirgiannis.
pages remained at the same rank). We computed PageRank
Representing and quantifying rank-change for the Web
scores for each snapshot of the DBLP and Internet Archive
datasets, ordered the pages and calculated the racer values