首页    期刊浏览 2024年09月18日 星期三
登录注册

文章基本信息

  • 标题:BAHASA INDONESIA TEXT CORPUS GENERATION USING WEB CORPORA APPROACHES
  • 本地全文:下载
  • 作者:AMALIA AMALIA ; OPIM SALIM SITOMPUL ; ERNA BUDHIARTI NABABAN
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2019
  • 卷号:97
  • 期号:24
  • 页码:3809-3821
  • 出版社:Journal of Theoretical and Applied
  • 摘要:A text corpus is a collection of texts stored electronically for various research and investigation for Natural Language Processing (NLP) needs. Bahasa Indonesia is the Indonesian language that is used officially by the Indonesian people, amounting to around 230 million people. However, the development of the NLP of Bahasa Indonesia is not as fast as English. One of the factors because Bahasa Indonesia still has limited linguistic resources like corpus. This study aims to generate Bahasa Indonesia text corpus using a web corpora approach for general purpose in NLP. We collected texts from the seven largest Indonesian online news sites with various categories covered in 52 URLs. The research stages contain resource observation, web structure analyzing, website crawling, scraping, and data cleaning. The last step, the clean data, which is a collection of sentences then arranged into a machine-readable format. In this study, the percentage of successful crawling content from the resources is 85.85% or 569.456 news articles, with 219.392 distinct tokens. It can be concluded that web corpora approaches can produce text corpus for Bahasa Indonesia.
  • 关键词:Natural Language Processing (NLP); Corpus; Bahasa Indonesia; Web Corpora; Scrapy
国家哲学社会科学文献中心版权所有