期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2019
卷号:97
期号:24
页码:3809-3821
出版社:Journal of Theoretical and Applied
摘要:A text corpus is a collection of texts stored electronically for various research and investigation for Natural Language Processing (NLP) needs. Bahasa Indonesia is the Indonesian language that is used officially by the Indonesian people, amounting to around 230 million people. However, the development of the NLP of Bahasa Indonesia is not as fast as English. One of the factors because Bahasa Indonesia still has limited linguistic resources like corpus. This study aims to generate Bahasa Indonesia text corpus using a web corpora approach for general purpose in NLP. We collected texts from the seven largest Indonesian online news sites with various categories covered in 52 URLs. The research stages contain resource observation, web structure analyzing, website crawling, scraping, and data cleaning. The last step, the clean data, which is a collection of sentences then arranged into a machine-readable format. In this study, the percentage of successful crawling content from the resources is 85.85% or 569.456 news articles, with 219.392 distinct tokens. It can be concluded that web corpora approaches can produce text corpus for Bahasa Indonesia.
关键词:Natural Language Processing (NLP); Corpus; Bahasa Indonesia; Web Corpora; Scrapy