文章基本信息

标题：Bootstrapping Lexical Knowledge from Unsegmented Text using Graph Kernels
本地全文：下载
作者：Masato HAGIWARA ; Yasuhiro OGAWA ; Katsuhiko TOYAMA 等
期刊名称：人工知能学会論文誌
印刷版ISSN：1346-0714
电子版ISSN：1346-8030
出版年度：2011
卷号：26
期号：3
页码：440-450
DOI：10.1527/tjsai.26.440
出版社：The Japanese Society for Artificial Intelligence
摘要：Extraction of named entitiy classes and their relationships from large corpora often involves morphological analysis of target sentences and tends to suffer from out-of-vocabulary words. In this paper we propose a semantic category extraction algorithm called Monaka and its graph-based extention g-Monaka , both of which use character n -gram based patterns as context to directly extract semantically related instances from unsegmented Japanese text. These algorithms also use ``bidirectional adjacent constraints,'' which states that reliable instances should be placed in between reliable left and right context patterns, in order to improve proper segmentation. Monaka algorithms uses iterative induction of instaces and pattens similarly to the bootstrapping algorithm Espresso . The g-Monaka algorithm further formalizes the adjacency relation of character n -grams as a directed graph and applies von Neumann kernel and Laplacian kernel so that the negative effect of semantic draft , i.e., a phenomenon of semantically unrelated general instances being extracted, is reduced. The experiments show that g-Monaka substantially increases the performance of semantic category acquisition compared to conventional methods, including distributional similarity, bootstrapping-based Espresso , and its graph-based extension g-Espresso , in terms of F-value of the NE category task from unsegmented Japanese newspaper articles.
关键词：bootstrapping ; named entity extraction ; semantic category ; unsegmented text ; link analysis ; graph kernel