首页    期刊浏览 2024年09月20日 星期五
登录注册

文章基本信息

  • 标题:Cost-Sensitive Topical Data Acquisition from the Web
  • 本地全文:下载
  • 作者:Mahdi Naghibi ; Reza Anvari ; Ali Forghani
  • 期刊名称:International Journal of Data Mining & Knowledge Management Process
  • 印刷版ISSN:2231-007X
  • 电子版ISSN:2230-9608
  • 出版年度:2019
  • 卷号:9
  • 期号:2/3
  • 页码:39-56
  • DOI:10.5121/ijdkp.2019.9304
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.
  • 关键词:Cost-Sensitive Learning; Data acquisition; Topical Crawler; Link Context; Web Data Mining
国家哲学社会科学文献中心版权所有