首页    期刊浏览 2024年12月01日 星期日
登录注册

文章基本信息

  • 标题:A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings
  • 本地全文:下载
  • 作者:Babić, Karlo ; Guerra, Francesco ; Martinčić-Ipšić, Sanda
  • 期刊名称:Journal of Information and Organizational Sciences
  • 印刷版ISSN:1846-3312
  • 电子版ISSN:1846-9418
  • 出版年度:2020
  • 卷号:44
  • 期号:2
  • 页码:231-246
  • DOI:10.31341/jios.44.2.2
  • 出版社:Faculty of Organization and Informatics University of Zagreb
  • 摘要:Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.
  • 关键词:semantic similarity; short texts similarity; word embedding; Word2Vec; FastText; TF-IDF
国家哲学社会科学文献中心版权所有