文章基本信息

标题：Introducing syntactic features into a Bayesian classifier.
作者：Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Semantic classification can be considered the next level in what we like to call right now the future "Semantic Web", where information should be better classified and easier to access. Hence, the need for semantic processing technologies show an increasing importance in the future of internet. A key challenge in the semantic web is the mapping between different concepts. Many techniques for such mapping exist, but most of them induce a one-to-one mapping, which does not seem to correspond to real world problems. This project proposes a new approach, which tries to use the power of machine learning, and in particular classification algorithms, to solve the mapping task. Several techniques have been proposed in time, starting from Bayesian filters, KNN-algorithms, Neural Networks, CNG--(Contextual Network Graphs) (Ceglowski et al., 2003), LSI--(Latent Semantic Indexing) (Deerwester et al., 2001), TF-IDF--(Term Frequency--Inverse Document Frequency), NLP--(Natural Language Processing) and many others, yet few use other information in their decision process besides statistics, synonymy and other relations between documents and word frequencies.

Introducing syntactic features into a Bayesian classifier.

Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等

1. INTRODUCTION

Semantic classification can be considered the next level in what we like to call right now the future "Semantic Web", where information should be better classified and easier to access. Hence, the need for semantic processing technologies show an increasing importance in the future of internet. A key challenge in the semantic web is the mapping between different concepts. Many techniques for such mapping exist, but most of them induce a one-to-one mapping, which does not seem to correspond to real world problems. This project proposes a new approach, which tries to use the power of machine learning, and in particular classification algorithms, to solve the mapping task. Several techniques have been proposed in time, starting from Bayesian filters, KNN-algorithms, Neural Networks, CNG--(Contextual Network Graphs) (Ceglowski et al., 2003), LSI--(Latent Semantic Indexing) (Deerwester et al., 2001), TF-IDF--(Term Frequency--Inverse Document Frequency), NLP--(Natural Language Processing) and many others, yet few use other information in their decision process besides statistics, synonymy and other relations between documents and word frequencies.

Latent semantic indexing (LSI) is a vector-space technique that has shown great promise in addressing the problems of polysemy and synonymy in text collections, offering improved recall over standard keyword searches. One of the most striking features of LSI (Landauer et al., 2000) is the ability to return relevant results for queries even when there is no exact keyword match. Because the LSI vector model is a purely mathematical representation, the indexing technique is not limited to any particular language. (Jensen & Vikner, 1994)

A standard step in LSI is the creation of a term-document matrix (TDM), which is essentially a weighted lookup table of term frequency data for the entire document collection. In LSI, this matrix is interpreted as a high-dimensional vector space. An alternative interpretation of this matrix is possible, however, with the TDM representing a bipartite graph of term and document nodes where each non-zero value in the TDM corresponds to an edge connecting a term node to a document node. In this model, every term is connected to all of the documents in which the term appears, and every document has a link to each term contained in that document. The weighted frequency values in the TDM correspond to weights placed on the edges of the graph. We call this construct a contextual network graph. (Ceglowski et al., 2003)

Even though these two methods have shown great results so far, we believe that introducing syntactic information as a feature in the decision process will significantly improve their accuracy. In this paper we will experiment semantic classification using a multicategorial Bayesian filter and also including syntactic information about the analyzed documents.

2. PROPOSED METHOD

The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).

If we think of words as blocks with connectors coming out, then there are different types of connectors; connectors may also point to the right or to the left. A left-pointing connector connects with a right-pointing connector of the same type on another word. The two connectors together form a "link". Right-pointing connectors are labeled "+", left-pointing connectors are labeled "-". Words have rules about how their connectors can be connected up, that is, rules about what would constitute a valid use of that word. A valid sentence is one in which all the words present are used in a way which is valid according to their rules, and which also satisfies certain global rules. (Sleator & Temperly, 1991)

The structure assigned to a sentence by a link grammar is rather unlike any other grammatical system that we know of (although it is certainly related to dependency grammar). Rather than thinking in terms of syntactic functions (like subject or object) or constituents (like "verb phrase"), one must think in terms of relationships between pairs of words. In the sentence below, for example, there is an "S" ("subject") relation between "dog" and "has"; a "PP" (past-participle) relationship between "has" and "gone"; and a "D" (determiner) relation between "the" and "dog". It may be seen, however, that parts of speech, syntactic functions, and constituents may be recovered from a link structure rather easily. For example, whatever word is on the left end of an "S" link is the subject of a clause (or the head word of the subject phrase); whatever is on the right end is the finite verb; whatever is on the left-end of a D link is a determiner; etc. (Grinberg et al., 1995).

The syntactic information representation can be represented as a prefix to each word included in a naive Bayesian dictionary. As an example, we will use "sub_John" to represent that the word "John" is the subject of the phrase in which it appears. (Laerty et al., 1992). A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model" (Zhang, 2004)

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

Statistical classification is probably the most popular text classification method used by semantic classifiers. Compared with a rule based filter, it has the advantage of a learning algorithm. (e.g. the most it classifies, the better its results). Of course, being one of the pioneers of text classification methods, different methods against this technology have been found, but its mathematical grounds (e.g. the fact that the semantics of a word can be statistically established by the frequency of the documents in which it appears) still offers good results. (Ceglowski 2003). Our purpose was to add extra information to this classifier. Even though it is able to determine the semantics of a word statistically, we believe that this determination would be more accurate if we would also know the syntactic function of the analyzed word.

In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many complex real-world situations than one might expect. Recently, careful analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. (Zhang 2004)

Also, it has been shown that on semantically distinct categories the Bayesian classifier performs quite well, but on similar ones, it lacks accuracy. Our results show that introducing syntactic features will increase accuracy and decrease misclassification on all documents. Basically, when using semantically distinct categories, a Bayesian module would obtain very good results, but when a large part of this categories have common parts (like physics, quantum physics and mechanics) there is a large risk of misclassifications. In other words, semantic equivalent classes will cause a Bayesian module to run inadequately.

3. RESULTS

Our experiment consisted in the following data:

* A naive Bayesian module

* 25 scientific domains (among which some were branches some were sub-branches)

* A large corpus data consisting in publications and books used in the faculty

On an initial training and testing phase (without inserting the link grammar module), results showed a poor classification. Although on distinct categories, there was a 0.5% misclassification, on similar categories this percent raised to almost 25%--on the training corpus and 33% on a test corpus that was not presented in the training list of documents.

The next step was to repeat these tests after inserting syntactic information in the Bayesian module, using the canonical form presented in the proposed method section. This meant that new words had to be inserted in the dictionary (which led to approximately twice the size of the dictionary). Training also took longer since a pre-processing module was run before the Bayesian module. On distinct categories, the misclassification rate dramatically decreased to 0.01% while on similar categories it decreased to 15% on the training corpus and 27% on untrained corpus.

4. CONCLUSIONS

Although the presented results might seem low and inconcludent, we are confident that using syntactic information in a semantic classifier would increase its accuracy by at least 10% in the Bayesian case, and a lot more on better semantic classifiers like LSI, CNG and so on. Also, we had a mixed commission of people, consisting both professors and PhD students, which evaluated and performed a manual classification for the similar categories documents. Although it might seem incredible, the results provided by our classifier were higher than advised manual classification.

Acknowledgements

This work was entirely supported by BitDefender--AntiSpam Laboratory. Web: www.bitdefender.com

5. REFERENCES

Ceglowski M., Coburn A., Cuadrado J., (2003) Semantic Search of Unstructured Data Using Contextual Network Graphs.

Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R. (2001), Indexing by latent semantic analysis, Journal of the American Society for Information Science

Grinberg D., Lafferty J., Sleator D. (1995) A Robust Parsing Algorithm For Link Grammars, Computer Science--Computation and Language

Jensen P., Vikner C. (1994) Lexical knowledge and the semantic analysis of Danish genitive constructions, Topics in Knowledge-based NLP Systems

Laerty J., Sleator D., Temperley D. (1992), Grammatical trigrams: A probabilistic model of link grammar, Proc. of the 1992 AAAI Fall Symp. on Probabilistic Approaches to Natural Language

Landauer T., Foltz P., Laham D., (2000) An introduction to Latent Semantic Indexing, Department of philosophy, University of Colorado at Boulder

Sleator D. K., B.; Temperley D. (1991). Parsing English with a Link Grammar, School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

Zhang H. (2004). The Optimality of Naive Bays, American Association for Artificial Intelligence (www.aaai.org)