Semantic graph classifier for HTML documents.
Dichiu, Daniel ; Sgarciu, Valentin
1. INTRODUCTION
Our research focused on the automation of text classification based
on the semantic relationship between words, without any training
realised by the user. In this way, the algorithm can be used out of the
box and is not related to machine learning. This next step, where the
algorithm can train itself to perform better classifications, will be
taken in future research.
A basic requirement for good software is its processing speed. In
this respect, the algorithm can be customized to meet both reasonable
speed and reasonable accuracy demands, as it will be shown in this
paper.
Web-based text is generally written in natural language, albeit not
just English (exceptions are e.g. chunks of source code). This gives
rise to problems in automated text classification because machines are
not very good at interpreting natural language. One way to solve these
problems is associating keywords related to the text (Golder &
Huberman, 2005). As such, graphs can be used to represent the different
keywords and the semantic relations between them.
Unlike computers, human cognition uses semantic and syntactic
relations between words in order to give significance to natural
language. Our method tries to emulate a part of this human cognition by
investigating semantic relationships between words in a given text.
2. TEXT CLASSIFICATION METHODS
Common text classification problems appear due to polysemy (i.e.
same word has different meanings) and synonymy (i.e. two different words
have the same or closely related meaning--they can be used
interchangeably in a text, without the meaning of the text being
changed).
Most notable text classification methods are related to searching
large data collections (Latent Semantic Indexing--LSI, Contextual
Network Graphs -- CNG) or filtering data (naive Bayes classifier)
(Ceglowski et al., 2004; Kelleher, 2004).
LSI uses a term-document matrix that has to be computed each time a
new term or a new document is added, which is a disadvantage. Another
downside of the LSI algorithm is the processing time needed to return a
result. This is caused by the single value decomposition algorithm that
has to execute each time LSI is used. The main advantage of the method
is the ability to return relevant results for queries even when there is
no exact term match, thus solving the synonymy problem.
CNG is an adaptation of the LSI in the sense that the term-document
matrix is transformed into a graph where the nodes are either terms or
documents and the edges are the weighted frequency values in the TDM.
Each document is linked to all the terms that it contains and each term
is connected to all the documents in which it appears. For performing a
search, all the nodes that contain the terms in the query are given an
energy level of 1. Then, the algorithm spreads this energy to all
connected nodes by multiplying the source-node energy with the edge
value (the edges have a value between 0 and 1 to ensure that the energy
will eventually stop propagating). The advantage of this method is that
it is scalable and it can be easily parallelised in order to increase
processing speed.
3. SEMANTIC GRAPH CLASSIFIER
Our classification algorithm is based on Contextual Network Graphs,
but it each node represents only a word.
To see whether a web page belongs to a certain domain, we follow
these steps: we gather a relatively short list of words associated with
the field of interest. This short list contains words which are most
frequently used in that domain. Then, two other lists of words are
created from the words on the short list: one with terms related to the
field of interest and one with terms outside the field of interest. From
these other two lists we construct two distinct graphs (each node
contains a word and related nodes/words are connected by a weighted
edge) that have in common only words from the shortlist. For each
document to be classified, the value of each node is initialised to 0,
while the weighted edge has a value between 0 and 1, depending on the
relationship between the words.
The next step is to look up the terms on the short list in the
document to be classified. For each term that we encounter in the
document, the energy of the associated node in the two distinct graphs
is given a value of 1. Then, the energy is propagated from each node
with a value above 0 by using the following equation:
energy(dn) = energy(sn) * edge(sn, dn) + energy(dn) (1)
where energy(dn) is the energy of the destination node, energy(sn)
is the energy of the source node and edge(sn,dn) is the value of the
edge between the source node and the edge node. At this stage, the
semantic content of the document is contained in the values of the edges
and the newly activated nodes (i.e. have an energy above a predetermined
threshold). After this step, all the terms that are activated will be
inserted into two new lists that are characteristic for each document:
one with terms associated to the field of interest and one with terms
which are not associated with the field of interest. The final step is
to classify the document based on the number of elements of the two new
lists.
For example, if we want to classify a web page as a sports document
or as a non-sports document, we would make a short list with words like
"sport", "football", "player",
"field", "team", "play",
"coach", "game", etc. As one can see, some of these
words are polysemous and some are synonymous. In a semantic graph
classifier, the synonymy problem is solved by creating an edge with a
value of 1 between synonymous words. The polysemy problem is solved by
using the second graph in which terms are connected to other terms not
in the desired field of interest (i.e. sports). Because the value of any
edge is not changed during execution, special attention must be paid to
choosing the words and weights between the words.
[FIGURE 1 OMITTED]
An advantage of this semantic classifier approach is its robustness
to the synonymy and polysemy problems.
4. ALGORITHM IMPLEMENTATION
The algorithm takes as input an HTML web page, outputs its category
and was implemented in Python. Figure 1 is a flowchart of the algorithm.
In this implementation we choose speed over accuracy--we remove
just stop-words and common words from each document before we apply the
semantic classifier, thus only doing minimal pre-processing.
After the document is downloaded, it is striped of some HTML tags
(e.g. script, style, span). Then, the keywords on the shortlist are
searched in the text part of the document. If found, the nodes
corresponding to them are activated (they get a value of 1) in both
graphs.
The next step is to propagate the energy from the activated nodes,
using equation (1), until each node is updated at most 10 times. At the
end of this step, all the nodes that have a value above a predefined
threshold are returned by the graph. These new terms are then searched
in the document and if found, are inserted into a list (at this stage,
there is a list for each graph).
The classification criteria is the number of words that the two
graphs return. Documents that return more words from one graph are
classified accordingly. Documents that return the same number of words
from the two graphs are not classified.
5. RESULTS
To test the algorithm, after several adjustments, the following
parameters were used:
* A list of 24 gambling-related words were put on the short
list;
* A list of 47 words semantically related to the words on the short
list and to gambling was realised;
* A list of 103 words semantically related to the words on the
short list, but not to gambling, was realised;
* Two graphs (one with terms related to gambling and one with terms
not related to gambling) were realised; the values of the edges between
nodes were assigned based on the frequency of terms encountered in a
corpus of 100 documents;
* The threshold for node activation (above which the associated
term was returned by the graph) was set to 0.4.
The algorithm was tested against a corpus of 2276 web pages
gathered by using Google search engine on the following keywords:
"free casino", "online slots", "texas
hold'em", "poker", "free poker games",
"casino poker". Out of these documents, a total of 1258 were
classified as gambling documents, with only 4 documents being
misclassified. Out of the rest 1018 documents, 179 documents were
non-gambling and classified accordingly, while 839 documents were
gambling related, but classified as non-gambling.
This behaviour can be explained by:
* The low number of words on the short list which didn't cover
the gambling domain well enough;
* The fact that the algorithm does not include expressions (in this
instance, valuable expressions would have been "texas
hold'em" or "slot machine"), which does have an
impact on the accuracy of the method due to their less polysemous
nature;
* The fact that some web pages included text as images.
Each web page was processed in an average of 600 milliseconds on a
3 GHz Pentium 4, 1 GB of RAM computer and Windows XP SP2 operating
system.
6. FUTURE RESEARCH
One of solving the polysemy problem is to use a part-of- speech
tagger (Bird et al., 2009), which tags each word as a noun, or as a
verb, or as an adjective, etc. We plan to introduce this extra
information in our semantic classifier.
An improvement of the current method will be by simply allowing
expressions in the nodes of the semantic graph.
We would also like to find an artificial intelligence algorithm
which changes the graph structure depending on the documents it already
classified (thus using unsupervised learning).
7. CONCLUSION
These results show that this kind of semantic classifier has a lot
of potential taking into consideration the improvements we want to make.
This classification method is good for domains where terms are most
likely to have a certain meaning more than any other meanings (i.e.
poker and blackjack are certainly related to gambling and not related to
any terms outside gambling domain). Counter-examples are documents that
need to be classified as violence related, or tabloid related.
This method for web page classification can be used in parental
control software.
8. REFERENCES
Bird, S.; Klein, E. & Loper E. (2009). Natural Language
Processing with Python, O'Reilly Media, ISBN 13: 978-0596516499
Ceglowski, M.; Coburn, A. & Cuadrado, J. (2004). Semantic
Search of Unstructured Data using Contextual Network Graphs, Available
from: http://www.knowledgesearch.org/papers/
Contextual_Network_Graphs.pdf Accessed: 2009-04-02
Golder, S.A. & Huberman, B.A. (2005). The Structure of
Collaborative Tagging System, Available from:
www.hpl.hp.com/research/idl/papers/tags/tags.pdf Accessed: 2009-04-12
Kelleher, D. (2004). Spam filtering using contextual network
graphs, Available from:
https://www.cs.tcd.ie/courses/csll/dkellehe0304.pdf Accessed: 2009-04-02