期刊名称:International Journal of Advanced Research In Computer Science and Software Engineering
印刷版ISSN:2277-6451
电子版ISSN:2277-128X
出版年度:2012
卷号:2
期号:8
出版社:S.S. Mishra
摘要:My work introduces a hidden topic-based framework for processing short andsparse documents (e.g., searchresult snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The frameworkfocuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2)synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter arebig linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of theframework is that common hidden topics discovered from large external data sets (universal data sets), when included,can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets helphandle unseen data better. The proposed framework can also be applied for different natural languages and datadomains. We carefully evaluated the framework by carrying out two experiments for two important online applications(Web search result classification and matching/ranking for contextual advertising) with large-scale universal data setsand we achieved significant results.
关键词:Webmining;hidden topics; classification;sparse data