Introducing syntactic features into a Bayesian classifier.
Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等
1. INTRODUCTION
Semantic classification can be considered the next level in what we
like to call right now the future "Semantic Web", where
information should be better classified and easier to access. Hence, the
need for semantic processing technologies show an increasing importance
in the future of internet. A key challenge in the semantic web is the
mapping between different concepts. Many techniques for such mapping
exist, but most of them induce a one-to-one mapping, which does not seem
to correspond to real world problems. This project proposes a new
approach, which tries to use the power of machine learning, and in
particular classification algorithms, to solve the mapping task. Several
techniques have been proposed in time, starting from Bayesian filters,
KNN-algorithms, Neural Networks, CNG--(Contextual Network Graphs)
(Ceglowski et al., 2003), LSI--(Latent Semantic Indexing) (Deerwester et
al., 2001), TF-IDF--(Term Frequency--Inverse Document Frequency),
NLP--(Natural Language Processing) and many others, yet few use other
information in their decision process besides statistics, synonymy and
other relations between documents and word frequencies.
Latent semantic indexing (LSI) is a vector-space technique that has
shown great promise in addressing the problems of polysemy and synonymy
in text collections, offering improved recall over standard keyword
searches. One of the most striking features of LSI (Landauer et al.,
2000) is the ability to return relevant results for queries even when
there is no exact keyword match. Because the LSI vector model is a
purely mathematical representation, the indexing technique is not
limited to any particular language. (Jensen & Vikner, 1994)
A standard step in LSI is the creation of a term-document matrix
(TDM), which is essentially a weighted lookup table of term frequency
data for the entire document collection. In LSI, this matrix is
interpreted as a high-dimensional vector space. An alternative
interpretation of this matrix is possible, however, with the TDM
representing a bipartite graph of term and document nodes where each
non-zero value in the TDM corresponds to an edge connecting a term node
to a document node. In this model, every term is connected to all of the
documents in which the term appears, and every document has a link to
each term contained in that document. The weighted frequency values in
the TDM correspond to weights placed on the edges of the graph. We call
this construct a contextual network graph. (Ceglowski et al., 2003)
Even though these two methods have shown great results so far, we
believe that introducing syntactic information as a feature in the
decision process will significantly improve their accuracy. In this
paper we will experiment semantic classification using a multicategorial
Bayesian filter and also including syntactic information about the
analyzed documents.
2. PROPOSED METHOD
The Link Grammar Parser is a syntactic parser of English, based on
link grammar, an original theory of English syntax. Given a sentence,
the system assigns to it a syntactic structure, which consists of a set
of labeled links connecting pairs of words. The parser also produces a
"constituent" representation of a sentence (showing noun
phrases, verb phrases, etc.).
If we think of words as blocks with connectors coming out, then
there are different types of connectors; connectors may also point to
the right or to the left. A left-pointing connector connects with a
right-pointing connector of the same type on another word. The two
connectors together form a "link". Right-pointing connectors
are labeled "+", left-pointing connectors are labeled
"-". Words have rules about how their connectors can be
connected up, that is, rules about what would constitute a valid use of
that word. A valid sentence is one in which all the words present are
used in a way which is valid according to their rules, and which also
satisfies certain global rules. (Sleator & Temperly, 1991)
The structure assigned to a sentence by a link grammar is rather
unlike any other grammatical system that we know of (although it is
certainly related to dependency grammar). Rather than thinking in terms
of syntactic functions (like subject or object) or constituents (like
"verb phrase"), one must think in terms of relationships
between pairs of words. In the sentence below, for example, there is an
"S" ("subject") relation between "dog" and
"has"; a "PP" (past-participle) relationship between
"has" and "gone"; and a "D" (determiner)
relation between "the" and "dog". It may be seen,
however, that parts of speech, syntactic functions, and constituents may
be recovered from a link structure rather easily. For example, whatever
word is on the left end of an "S" link is the subject of a
clause (or the head word of the subject phrase); whatever is on the
right end is the finite verb; whatever is on the left-end of a D link is
a determiner; etc. (Grinberg et al., 1995).
The syntactic information representation can be represented as a
prefix to each word included in a naive Bayesian dictionary. As an
example, we will use "sub_John" to represent that the word
"John" is the subject of the phrase in which it appears.
(Laerty et al., 1992). A naive Bayes classifier is a simple
probabilistic classifier based on applying Bayes' theorem with
strong (naive) independence assumptions. A more descriptive term for the
underlying probability model would be "independent feature
model" (Zhang, 2004)
[FIGURE 1 OMITTED]
[FIGURE 2 OMITTED]
Depending on the precise nature of the probability model, naive
Bayes classifiers can be trained very efficiently in a supervised
learning setting. In many practical applications, parameter estimation
for naive Bayes models uses the method of maximum likelihood; in other
words, one can work with the naive Bayes model without believing in
Bayesian probability or using any Bayesian methods.
Statistical classification is probably the most popular text
classification method used by semantic classifiers. Compared with a rule
based filter, it has the advantage of a learning algorithm. (e.g. the
most it classifies, the better its results). Of course, being one of the
pioneers of text classification methods, different methods against this
technology have been found, but its mathematical grounds (e.g. the fact
that the semantics of a word can be statistically established by the
frequency of the documents in which it appears) still offers good
results. (Ceglowski 2003). Our purpose was to add extra information to
this classifier. Even though it is able to determine the semantics of a
word statistically, we believe that this determination would be more
accurate if we would also know the syntactic function of the analyzed
word.
In spite of their naive design and apparently over-simplified
assumptions, naive Bayes classifiers often work much better in many
complex real-world situations than one might expect. Recently, careful
analysis of the Bayesian classification problem has shown that there are
some theoretical reasons for the apparently unreasonable efficacy of
naive Bayes classifiers. An advantage of the naive Bayes classifier is
that it requires a small amount of training data to estimate the
parameters (means and variances of the variables) necessary for
classification. Because independent variables are assumed, only the
variances of the variables for each class need to be determined and not
the entire covariance matrix. (Zhang 2004)
Also, it has been shown that on semantically distinct categories
the Bayesian classifier performs quite well, but on similar ones, it
lacks accuracy. Our results show that introducing syntactic features
will increase accuracy and decrease misclassification on all documents.
Basically, when using semantically distinct categories, a Bayesian
module would obtain very good results, but when a large part of this
categories have common parts (like physics, quantum physics and
mechanics) there is a large risk of misclassifications. In other words,
semantic equivalent classes will cause a Bayesian module to run
inadequately.
3. RESULTS
Our experiment consisted in the following data:
* A naive Bayesian module
* 25 scientific domains (among which some were branches some were
sub-branches)
* A large corpus data consisting in publications and books used in
the faculty
On an initial training and testing phase (without inserting the
link grammar module), results showed a poor classification. Although on
distinct categories, there was a 0.5% misclassification, on similar
categories this percent raised to almost 25%--on the training corpus and
33% on a test corpus that was not presented in the training list of
documents.
The next step was to repeat these tests after inserting syntactic
information in the Bayesian module, using the canonical form presented
in the proposed method section. This meant that new words had to be
inserted in the dictionary (which led to approximately twice the size of
the dictionary). Training also took longer since a pre-processing module
was run before the Bayesian module. On distinct categories, the
misclassification rate dramatically decreased to 0.01% while on similar
categories it decreased to 15% on the training corpus and 27% on
untrained corpus.
4. CONCLUSIONS
Although the presented results might seem low and inconcludent, we
are confident that using syntactic information in a semantic classifier
would increase its accuracy by at least 10% in the Bayesian case, and a
lot more on better semantic classifiers like LSI, CNG and so on. Also,
we had a mixed commission of people, consisting both professors and PhD
students, which evaluated and performed a manual classification for the
similar categories documents. Although it might seem incredible, the
results provided by our classifier were higher than advised manual
classification.
Acknowledgements
This work was entirely supported by BitDefender--AntiSpam
Laboratory. Web: www.bitdefender.com
5. REFERENCES
Ceglowski M., Coburn A., Cuadrado J., (2003) Semantic Search of
Unstructured Data Using Contextual Network Graphs.
Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R.
(2001), Indexing by latent semantic analysis, Journal of the American
Society for Information Science
Grinberg D., Lafferty J., Sleator D. (1995) A Robust Parsing
Algorithm For Link Grammars, Computer Science--Computation and Language
Jensen P., Vikner C. (1994) Lexical knowledge and the semantic
analysis of Danish genitive constructions, Topics in Knowledge-based NLP Systems
Laerty J., Sleator D., Temperley D. (1992), Grammatical trigrams: A
probabilistic model of link grammar, Proc. of the 1992 AAAI Fall Symp.
on Probabilistic Approaches to Natural Language
Landauer T., Foltz P., Laham D., (2000) An introduction to Latent
Semantic Indexing, Department of philosophy, University of Colorado at
Boulder
Sleator D. K., B.; Temperley D. (1991). Parsing English with a Link
Grammar, School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
Zhang H. (2004). The Optimality of Naive Bays, American Association
for Artificial Intelligence (www.aaai.org)