首页    期刊浏览 2024年12月01日 星期日
登录注册

文章基本信息

  • 标题:Digital Library of Online PDF Sources: An ETL Approach
  • 本地全文:下载
  • 作者:Gohar Zaman ; Hairulnizam Mahdin ; Khalid Hussain
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2020
  • 卷号:20
  • 期号:10
  • 页码:141-149
  • DOI:10.22937/IJCSNS.2020.20.10.19
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:It is evident from day to day web usage experience that a huge number of PDF sources have been uploaded on daily basis. For example, there are several scientific societies that publish volumes of articles and periodicals like IEEE, ACM, Elsevier, and Springer etc. Most of these resources are unstructured or semi-structured that makes it difficult to search and retrieve information. In this paper, an effective model for digital library creation is proposed which is originally motivated by an automated ontological information extraction framework (OFIE). The framework takes a PDF published paper, extracts its structural information like title, authors, abstract, funding information, table of contents, references etc. with the help of fuzzy rule-based system (FRBS) and word sense disambiguation (WSD) approach. Consequently, this extracted information is converted to RDF triples. The proposed scheme takes this extracted information and converts into a digital library stored in MS-SQL databased by Extract, Transform and Load (ETL) process. This digital library can be an institute’s library or an individual scholar’s library who is interested in synthesizing his downloaded PDF files for better search and retrieve purposes. Moreover, by using the SQL queries based front-end design, the information can be searched, retrieved, and exported in the form of reports.
  • 关键词:Ontology; Digital Library; ETL; SQL; RDF; OFIE; FRBS
国家哲学社会科学文献中心版权所有