期刊名称:International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
出版社:Shri Pannalal Research Institute of Technolgy
摘要:Unstructured documents refer to documents thatcontain information that either does not have a predefineddata model or is not organized in predefined manner i.e.informal descriptions. Unstructured documents are heavywith lots of text information along with dates, numbers andfacts as well. Common techniques for structuring textinvolves manually tagging with metadata, manual annotationis difficult because of number of issues, like the annotatormust be familiar with the domain of the document of interest,preliminary training and guidelines are necessary for aparticular annotation task as well as the process is timeconsuming and error-prone. Examples of unstructured dataare books, journals, documents, body of e-mail, notes, datafrom technical surveys. This paper describes some of thedifficulties in working with unstructured text collections andmethods to overcome them. To obtain some competitiveadvantage in processing unstructured data, attributes have tobe generated not only for single terms but for combined termsalso.