摘要:One way to increase the understanding of texts by machines is through adding semantic information to lexical items by including metadata tags, a process also called semantic annotation. There are several semantic aspects that can be added to the words, among them the information about the nature of the concept denoted through the association with a category of an ontology. The application of ontologies in the annotation task can span multiple domains. However, this particular research focused its approach on top-level ontologies due to its generalizing characteristic. Considering that annotation is an arduous task that demands time and specialized personnel to perform it, much is done on ways to implement the semantic annotation automatically. The use of machine learning techniques are the most effective approaches in the annotation process. Another factor of great importance for the success of the training process of the supervised learning algorithms is the use of a sufficiently large corpus and able to condense the linguistic variance of the natural language. In this sense, this article aims to present an automatic approach to enrich documents from the American English corpus through a CRF model for semantic annotation of ontologies from Schema.org top-level. The research uses two approaches of the model obtaining promising results for the development of semantic annotation based on top-level ontologies. Although it is a new line of research, the use of top-level ontologies for automatic semantic enrichment of texts can contribute significantly to the improvement of text interpretation by machines.
关键词:information extraction; semantic annotation; ontology; condition random fields information extraction ; semantic annotation ; ontology ; condition random fields