文章基本信息

标题：Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments
本地全文：下载
作者：Tiina Lindh-Knuutila ; Tiina Lindh-Knuutila ; Timo Honkela 等
期刊名称：Computational Cognitive Science
电子版ISSN：2195-3961
出版年度：2015
卷号：1
期号：1
页码：1-25
DOI：10.1186/s40469-015-0001-1
语种：English
出版社：Springer
摘要：Abstract Background In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources. Methods We apply two unsupervised learning methods, Independent Component Analysis (ICA), and probabilistic topic model at word level using Latent Dirichlet Allocation (LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries. In addition, we use the Self-Organizing Map to visualize the obtained representations. Results We show that both methods find a considerable amount of category information in an unsupervised way. Rather than only finding groups of similar words, they can automatically find a number of features that characterize words. The unsupervised methods are also used in exploration. They provide findings which go beyond the manually predefined label sets. In addition, we demonstrate how the Self-Organizing Map visualization can be used in exploration and further analysis. Conclusion This article compares unsupervised learning methods and semantically labeled dictionaries. We show that these methods are able to find categorical information. In addition, they can further be used in an exploratory analysis. In general, information theoretically motivated and probabilistic methods provide results that are at a comparable level. Moveover, the automatic methods and human classifications give an access to semantic categorization that complement each other. Data-driven methods can furthermore be cost effective and adapt to a particular domain through appropriate choice of data sets.
关键词：Text mining;Semantic modeling;Machine learning;Lexical meaning;Semantic similarity;Independent component analysis;Latent Dirichlet Allocation