首页    期刊浏览 2024年10月07日 星期一
登录注册

文章基本信息

  • 标题:WEIDJ: Development of a new algorithm for semi-structured web data extractio
  • 本地全文:下载
  • 作者:Ily Amalina Ahmad Sabri ; Mustafa Man
  • 期刊名称:TELKOMNIKA (Telecommunication Computing Electronics and Control)
  • 印刷版ISSN:2302-9293
  • 出版年度:2021
  • 卷号:19
  • 期号:1
  • DOI:10.12928/telkomnika.v19i1.16205
  • 语种:English
  • 出版社:Universitas Ahmad Dahlan
  • 摘要:In the era of industrial digitalization, people are increasingly investing in solutions that allow their process for data collection, data analysis and performance improvement. In this paper, advancing web scale knowledge extraction and alignment by integrating few sources by exploring different methods of aggregation and attention is considered in order focusing on image information. The main aim of data extraction with regards to semi-structured data is to retrieve beneficial information from the web. The data from web also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. As the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time. In this research work, we propose an improved model namely wrapper extraction of image using document object model (DOM) and JavaScript object notation data (JSON) (WEIDJ) in response to the promising results of mining in a higher volume of image from a various type of format. To observe the efficiency of WEIDJ, we compare the performance of data extraction by different level of page extraction with VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547.
  • 关键词:document object model;JavaScript object notation;web data extraction;Wrapper extraction of image
国家哲学社会科学文献中心版权所有