首页    期刊浏览 2024年10月06日 星期日
登录注册

文章基本信息

  • 标题:Improving Topic Coherence Using Entity Extraction Denoising
  • 本地全文:下载
  • 作者:Ronald Cardenas ; Kevin Bello ; Alberto Coronado
  • 期刊名称:The Prague Bulletin of Mathematical Linguistics
  • 印刷版ISSN:0032-6585
  • 电子版ISSN:1804-0462
  • 出版年度:2018
  • 卷号:110
  • 期号:1
  • 页码:85-101
  • DOI:10.2478/pralin-2018-0004
  • 语种:English
  • 出版社:Walter de Gruyter GmbH
  • 摘要:Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.
国家哲学社会科学文献中心版权所有