期刊名称:International Journal of Computer Science and Network Security
印刷版ISSN:1738-7906
出版年度:2020
卷号:20
期号:5
页码:150-157
出版社:International Journal of Computer Science and Network Security
摘要:This paper proposes a model that can be used as a framework for preprocessing Arabic text on Twitter for data analysis and information extraction. The model provides an online collection of Arabic text on Twitter and stores it in a structured database. The source data are then preprocessed to derive clean, meaningful Arabic text from which information can be extracted. The paper presents new methods and algorithms for preprocessing unstructured Arabic text on social media, and it provides solutions that address the difficulties of working with Arabic text on social media, including uncleaned, informal, and dialect language. The preprocessed Arabic text is stored in structured database tables to provide a useful data set to which information selection and data analysis algorithms can be applied. The implementation of the model yields a useful and full-featured dataset, and the text is presented as the source data, the cleaned data and separate Arabic words with their stems, roots and morphologies, among other forms. In addition, the model shows how information can be selected and extracted from this dataset.
关键词:Information retrieval;Natural Language Processing;Database;Data Analysis;Text Mining;Arabic Text.