文章基本信息

标题：A method for improving the prediction of next page request of a web user.
作者：Dinuca, Claudia Elena ; Ciobanu, Dumitru ; Istrate, Mihai 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2011
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Key words: web logs, clickstream, page rank, prediction
关键词：Algorithms;Information management;Web sites;Web sites (World Wide Web)

A method for improving the prediction of next page request of a web user.

Dinuca, Claudia Elena ; Ciobanu, Dumitru ; Istrate, Mihai 等

Abstract: In this article we presented a way to improve the prediction of the next page request of a web user obtained with Page Rank algorithm. The used idea is to apply Page Rank algorithm only on the subset of logs that contain the current page. To exemplify this method we use a set of web logs from the website of Nasa which is available online. We obtain an increase of visitation probability with 12.7% from 19.8% to 32.5%

Key words: web logs, clickstream, page rank, prediction

1. INTRODUCTION

A web site represents a set of interconnected web pages on the Web and is developed and maintained by a person or organization. While web sites constitute a medium for communication, publicity and commerce, Web Mining studies discover and analyze useful information from the Web (Nong, 2003).

Nowadays, there are many commercial and freeware software packages that provide basic statistics about web sites, including number of page views, hits, traffic patterns by day-of-week or hour-of-day, etc. These tools help ensure the correct operation of web sites (e.g., they may identify page not found errors) and can aid in identifying basic trends, such as traffic growth over time, or patterns such as differences between weekday and weekend traffic (Clark et al., 2006).

With growing pressure to make e-commerce sites more profitable, however, additional analyses are usually requested.

In this paper we present a method to help improve predictions of pages to be visited in order to create a recommendation system for web site users. We applied the Page Rank algorithm on NASA log file in order to get predictions for the next visited page. The method uses a table of probabilities of visited pages that are updated from time to time depending on the rate of visiting the website. This allows real-time calculation of visiting probability of the following pages considering the page where the user is at a specific time. The method can be used to create a recommendation system for web sites and pages preload to speed request response. The idea behind web caching is to maintain a highly efficient but small set of retrieved results in a cache, such that the system performance can be notably improved since later user requests can be directly answered from the cache. The recommendation should use the first three pages with the highest probability of visitation returned by the program.

2. DATA PREPROCESING

Log files are created by web servers and filled with information about user requests on a particular Web site. They may contain information about: domains, subdomains and host names; resources requested by the user, time of request, protocol used, errors returned by the server, the page size for successful requests.

Because a successful analysis is based on accurate information and quality data, preprocessing plays an important role. Preparation of data requires between 60 and 90% of the data analysis and contributes to the success rate of 75-90% to the entire process of extracting knowledge (Nong, 2003).

For each IP or DNS determine user sessions. The log files have entries like these:

199.72.81.55--[01/Jul/1995:00:00:01-0400] "GET /history/apollo/HTTP/1.0" 200 6245

unicomp6.unicomp.net--[01/Jul/1995:00:00:06-0400] "GET /shuttle/countdown/HTTP/1.0" 200 3985

199.120.110.21--[01/Jul/1995:00:00:09-0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085

burger.letters.com--[01/Jul/1995:00:00:11-0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0

199.120.110.21--[01/Jul/1995:00:00:11-0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

burger.letters.com--[01/Jul/1995:00:00:12-0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0.

As can be noticed above, each record in the file contains an IP, date and time, protocol, page views, error code, number of bytes transferred. The steps needed for data preprocessing were presented in detail in (Dinuca, 2011). For sessions' identification in the first case was considered that a user can not be stationed on a page more than 30 minutes. This value is used in several previous studies, as can be seen in the work (Markov & Larose, 2007). The current study intends to add an improvement in sessions' identification by determining an average time of page visiting the sites for the visit duration determined by analysis of web site visit duration, data which can be found in the log files of the site. Thus, for each visited page, the visit duration is calculated, which is determined by the difference between two consecutive timestamps for the same user who is identified by IP. For records of pages with the highest timestamp among those visited by a user is assigned a predefined value of our choice to 20,000 seconds. We calculated the average visiting time for a page as the media of time spent for different users on that page and used this mean to better identify sessions. When calculating the average visiting time we don't take into consideration the pages with the time less than 2 seconds and largest than 20,000 seconds. For our analysis we selected only those log records that contained a web page, eliminating the required load images and other files adjacent to it, this information being considered not important for analysis. We kept only pages that have status code of class 200, a successfully loaded page. We have removed double pages from sessions and we just kept for review sessions with more than 1 viewed pages.

3. METHODS AND RESULTS PRESENTATION

We used to predict the next page visited the Page Rank algorithm. We consider the current session a session in progress and current page is the page that the user is at the time. To improve the results we apply these methods only on sessions that contain the current page. From the all sessions we use about 85% for the calculation of the probability of visiting the page and on the rest sessions we check the accuracy of results.

For the first set of sessions we apply the Page Rank algorithm which provides us the ranks for pages from the websites. For each page we see on which pages can navigate and using the rank of pages we calculate the probability of visiting them by dividing each rank to the sum of ranks.

We implemented a program in Java using NetBeans IDE. It receives the log file in text format, write data for each session into a table, we code pages, we calculate the visiting time for each page, then calculate the average of each page visit, identify sessions, and apply Page Rank first on all chosen set of learning sessions and then only on the set of learning sessions that contain the current page obtaining the probabilities of visiting for first three more visited pages from where it can navigate from current page.

For the NASA data set we obtained after preprocessing 5138 sessions and we use 4486 for computing ranks and 652 for checking accuracy of the method. For each page we saved into a table the pages where it can navigate, pages with the highest probability of access obtained from the ranks of pages. n Fig. 1. the table presents a part of the withhold visitation probabilities. So, from page 1 it goes with PR1 probability in page with CP_PR1 code, it goes with PR2 probability in page with CP_PR2 code and in page with CP_PR3 code with PR3 probability, PR means the probability obtained by applying Page Rank algorithm and CP stands from Page Code.

[FIGURE 1 OMITTED]

Next, we use for each page in order to calculate ranks only sessions containing that page. Some of the ranks obtained can be seen in Fig. 2.

[FIGURE 2 OMITTED]

The 652 sessions that were used to verify results have in total 3501 pairs of pages. From all of these, as can be seen in Tab. 1., 292 are verified by the highest ranking page, 186 page second page rank and 215 at the third rank. The last two columns from the table represent the sum of the first two columns and the sum of the first three columns.

From the pages used to check data we obtained the data which can be seen in Tab. 2.

Using the probability that the next visited page is among the three pages indicated from the program was 19.8% when we used all sessions and 32.5% when in the calculations we used only sessions containing the current page.

4. CONCLUSION

The method presented can be used online for prediction, recommendation and preload pages as the ranks are saved in tables and can be easily accessed in real time. The update of these tables is only required from time to time depending on the use of the site.

In the performed analysis the time was only used during the identification of sessions. For improved results in future research will take into account the order in which pages appear in the session and how long the current user staied on the visited pages before the current page.

5. REFERENCES

Clark L., Ting I., Kimble C., Wrigth P., Kudenko D. (2006), Combining Ethnographic and Clickstream Data to Identify Strategies Information Research 11 (2), paper 249.

Cooley R., Mobasher B. and Srivastava. J. (1997), Web Mining: Information and Pattern Discovery on the World Wide Web. A survey paper. In Proc. ICTAI-97.

Database NASA Kennedy Space Center Log available online at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

Dinuca C. E. (2011), The process of data preprocessing for Web Usage Data Mining through a complete example, Annals of the Ovidius University, Economic Sciences Series Volume XI, Issue 1/2011.

Kohavi R., Parekh R. (2003), Ten supplementary analysis to improve e-commerce web sites, Proceedings of the Fifth WEBKDD workshop.

Liu B. (2006), Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer Berlin Heidelberg New York.

Markov Z., Larose D. T., (2007), DATA MINING THE WEB, Uncovering Patterns in Web Content, Structure and Usage, USA: John Wiley & Sons.

Nong, Y. (2003), The handbook of Data Mining, Lawrence Erlbaum Associates, Publishers Mahwah, New Jersey.

Robu, R.; Hora, C. & Stoicu--Tivadar, V. (2010). Improving the Classify User Interface in WEKA Explorer, Annals of DAAAM for 2010 & Proceedings of the 21st International DAAAM Symposium, 20-23rd October 2010, Zadar, Croatia, ISSN 1726-9679, ISBN 978-3-901509-73-5, Katalinic, B. (Ed.), pp. 0171-0172, Published by DAAAM International Vienna, Vienna.

Tab. 1. The number of correct predictions obtained when using
the entire dataset

pr1   pr2   pr3   pr 1+2   pr 1+2+3

292   186   215    478       693

Tab. 2. The number of correct predictions obtained by applying
rank page only on sessions that contain the current page

pr1   pr2   pr3   pr 1+2   pr 1+2+3

516   320   303    836       1139