期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2020
卷号:98
期号:23
页码:3741-3756
出版社:Journal of Theoretical and Applied
摘要:In recent years, due to the emergence of various social network platforms, a massive amount of data is continuously generated and shared. The majority of the data is unstructured, which contains information that might be crucial and valuable if analyzed. Effective use of these unstructured data is a tedious and labor-intensive task. Information extraction is one of the on-going research areas to extract potentially useful information out of voluminous data. Several different techniques and methods for information extraction have been proposed to understand the content and context of any available unstructured data at the low-level structure. However, there are limited studies conducted to investigate the challenges of Named Entity Recognition and Classification (NERC) on unstructured Malay data, which is known as one of the main subtasks in information extraction. Therefore, this paper addresses a comprehensive review of the existing NERC techniques for processing unstructured Malay data along with its limitations and challenges. The contributions of this paper are twofold. The primary contribution is it presents the overview of prior studies on NERC techniques of unstructured Malay data. Second, it scrutinizes the limitations and challenges of theses existing techniques due to the voluminous, dimensionality, and heterogeneity of unstructured Malay data. The findings show that most of the previous studies using a machine learning-based approach produce a satisfactory result rather than a rule-based approach. Furthermore, the challenges in terms of the different morphological of Malay language compared to resource-rich languages such as English, limitation of Malay corpus and annotated Malay text, and Malay text ambiguities could influence the performance of Malay NERC system efficiency, which should be carefully considered during the design of the systems.
关键词:Information Extraction;Malay Language;Named Entity Recognition and Classification;Unstructured Data