首页    期刊浏览 2025年07月23日 星期三
登录注册

文章基本信息

  • 标题:Vectorization of Text Documents for Identifying Unifiable News Articles
  • 本地全文:下载
  • 作者:Anita Kumari Singh ; Mogalla Shashi
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2019
  • 卷号:10
  • 期号:7
  • 页码:305-310
  • DOI:10.14569/IJACSA.2019.0100742
  • 出版社:Science and Information Society (SAI)
  • 摘要:Vectorization is imperative for processing textual data in natural language processing applications. Vectorization enables the machines to understand the textual contents by converting them into meaningful numerical representations. The proposed work targets at identifying unifiable news articles for performing multi-document summarization. A framework is introduced for identification of news articles related to top trending topics/hashtags and multi-document summarization of unifiable news articles based on the trending topics, for capturing opinion diversity on those topics. Text clustering is applied to the corpus of news articles related to each trending topic to obtain smaller unifiable groups. The effectiveness of various text vectorization methods, namely the bag of word representations with tf-idf scores, word embeddings, and document embeddings are investigated for clustering news articles using the k-means. The paper presents the comparative analysis of different vectorization methods obtained on documents from DUC 2004 benchmark dataset in terms of purity.
  • 关键词:Vectorization; news articles; tf-idf; word embeddings; document embeddings; text clustering
国家哲学社会科学文献中心版权所有