期刊名称:International Journal of Computer Science Research and Application
印刷版ISSN:2012-9564
电子版ISSN:2012-9572
出版年度:2012
卷号:2
期号:3
出版社:INREWI Publications
摘要:With the growth of the Internet and related tools, there has been an exponential growth of online resources. This tremendous growth has paradoxically made the task of finding, extracting and aggregating relevant information difficult. These days, finding and browsing news is one of the most important internet activities. In this paper, a hybrid method for online news article contents extraction is presented. The method combines RSS feeds and HTML Document Object Model (DOM) tree extraction. This approach is simple and effective at solving the problems associated with heterogeneous news layout and changing content found in many existing methods. The experimental results on some selected news sites show that the approach can extract news article contents automatically, effectively and consistently. The proposed method can also be adopted for other news sites.