文章基本信息

标题：A Character String-Based Stemming for Morphologically Derivative Languages
本地全文：下载
作者：Gvzelnur Imin ; Mijit Ablimit ; Hankiz Yilahun 等
期刊名称：Information
电子版ISSN：2078-2489
出版年度：2022
卷号：13
期号：4
页码：170
DOI：10.3390/info13040170
语种：English
出版社：MDPI Publishing
摘要：Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the model in this paper is compared with the traditional model on two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz. The experimental results show that the model in this paper has the best stemming effect on multilingual sentence-level datasets, which leads to more effective stemming. In addition, the proposed model outperforms other traditional models, and fully consider the data characteristics, and has certain advantages with less human intervention.