首页    期刊浏览 2024年09月15日 星期日
登录注册

文章基本信息

  • 标题:A Character String-Based Stemming for Morphologically Derivative Languages
  • 本地全文:下载
  • 作者:Gvzelnur Imin ; Mijit Ablimit ; Hankiz Yilahun
  • 期刊名称:Information
  • 电子版ISSN:2078-2489
  • 出版年度:2022
  • 卷号:13
  • 期号:4
  • 页码:170
  • DOI:10.3390/info13040170
  • 语种:English
  • 出版社:MDPI Publishing
  • 摘要:Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the model in this paper is compared with the traditional model on two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz. The experimental results show that the model in this paper has the best stemming effect on multilingual sentence-level datasets, which leads to more effective stemming. In addition, the proposed model outperforms other traditional models, and fully consider the data characteristics, and has certain advantages with less human intervention.
国家哲学社会科学文献中心版权所有