文章基本信息

标题：Semantic Tagging for Documents Using 'Short Text' Information
本地全文：下载
作者：Ayush Singhal ; Jaideep Srivastava
期刊名称：Computer Science & Information Technology
电子版ISSN：2231-5403
出版年度：2014
卷号：4
期号：5
页码：337-350
DOI：10.5121/csit.2014.4534
出版社：Academy & Industry Research Collaboration Center (AIRCC)
摘要：Tagging documents with relevant and comprehensive keywords offer invaluable assistance tothe readers to quickly overview any document. With the ever increasing volume and variety ofthe documents published on the internet, the interest in developing newer and successfultechniques for annotating (tagging) documents is also increasing. However, an interestingchallenge in document tagging occurs when the full content of the document is not readilyaccessible. In such a scenario, techniques which use “short text”, e.g., a document title, a newsarticle headline, to annotate the entire article are particularly useful. In this paper, we proposea novel approach to automatically tag documents with relevant tags or key-phrases usingonly “short text” information from the documents. We employ crowd-sourced knowledge fromWikipedia, Dbpedia, Freebase, Yago and similar open source knowledge bases to generatesemantically relevant tags for the document. Using the intelligence from the open web, we pruneout tags that create ambiguity in or “topic drift” from the main topic of our query document.We have used real world dataset from a corpus of research articles to annotate 50 researcharticles. As a baseline, we used the full text information from the document to generate tags. Theproposed and the baseline approach were compared using the author assigned keywords for thedocuments as the ground truth information. We found that the tags generated using proposedapproach are better than using the baseline in terms of overlap with the ground truth tagsmeasured via Jaccard index (0.058 vs. 0.044). In terms of computational efficiency, theproposed approach is at least 3 times faster than the baseline approach. Finally, wequalitatively analyse the quality of the predicted tags for a few samples in the test corpus. Theevaluation shows the effectiveness of the proposed approach both in terms of quality of tagsgenerated and the computational time.
关键词：Semantic annotation; open source knowledge; wisdom of crowds; tagging.