摘要:Most traditional scientific papers are unstructured documents, which are difficult to meet the requirement of structured retrieval, statistical classification and association analysis and other high-level application, how to extract and analyze the structured information of the papers becomes a challenging problem. A structured information extraction algorithm for unstructured and/or semi-structured machine-readable documents is proposed, it according to the extracted rules after feature learning on the basis of analyzing the basic structure and format features of traditional scientific papers, which extracts the title, author, abstract, keywords, text and other elements of paper from the unstructured documents, such as Word, then exports the structured text from the traditional scientific papers with the format required by multi-dimensional scientific papers, it can meet the requirements of structured retrieval, statistical classification and other high-level applications of scientific papers.