期刊名称:International Journal of Electrical and Computer Engineering
电子版ISSN:2088-8708
出版年度:2015
卷号:5
期号:3
页码:409-420
DOI:10.11591/ijece.v5i3.pp409-420
语种:English
出版社:Institute of Advanced Engineering and Science (IAES)
摘要:Text document is an important source of information and knowledge. Most of the knowledge needed in various domains for different purposes is in form of implicit content. Content of text is represented by keyphrases, which consist of one or more meaningful words. Keyphrases can be extracted from text through several steps of processing, including text preprocessing. Annotated Suffix Tree (AST) built from the documents collection itself is used to extract the keyphrase, after basic text preprocessing that includes removing stop words and stemming are applied. Combination of four variations of preprocessing is used. Two words (bi-words) and three words of phrases extracted are used as a list of keyphrases candidate which can help user who needs keyphrase information to understand content of documents. The candidate of keyphrase can be processed further by learning process to determine keyphrase or non keyphrase for the text domain with manual validation. Experiments using simulation corpus which keyphrases are determined from it show that keyphrases of two and three words can be extracted more than 90% and using real corpus of economy, keyphrases or meaning phrases can be extracted about 70%. The proposed method can be an effective ways to find candidate keyphrases from collection of text documents which can reduce non keyphrases or non meaning phrases from list of keyphrases candidate and detect keyphrases which are separated by stop words.
其他摘要:Text document is an important source of information and knowledge. Most of the knowledge needed in various domains for different purposes is in form of implicit content. Content of text is represented by keyphrases, which consist of one or more meaningful words. Keyphrases can be extracted from text through several steps of processing, including text preprocessing. Annotated Suffix Tree (AST) built from the documents collection itself is used to extract the keyphrase, after basic text preprocessing that includes removing stop words and stemming are applied. Combination of four variations of preprocessing is used. Two words (bi-words) and three words of phrases extracted are used as a list of keyphrases candidate which can help user who needs keyphrase information to understand content of documents. The candidate of keyphrase can be processed further by learning process to determine keyphrase or non keyphrase for the text domain with manual validation. Experiments using simulation corpus which keyphrases are determined from it show that keyphrases of two and three words can be extracted more than 90% and using real corpus of economy, keyphrases or meaning phrases can be extracted about 70%. The proposed method can be an effective ways to find candidate keyphrases from collection of text documents which can reduce non keyphrases or non meaning phrases from list of keyphrases candidate and detect keyphrases which are separated by stop words.