文章基本信息

标题：Semantic graph classifier for HTML documents.
作者：Dichiu, Daniel ; Sgarciu, Valentin
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Our research focused on the automation of text classification based on the semantic relationship between words, without any training realised by the user. In this way, the algorithm can be used out of the box and is not related to machine learning. This next step, where the algorithm can train itself to perform better classifications, will be taken in future research.
关键词：Document processing;Graph theory;HTML (Document markup language);Semantics

Semantic graph classifier for HTML documents.

Dichiu, Daniel ; Sgarciu, Valentin

1. INTRODUCTION

Our research focused on the automation of text classification based on the semantic relationship between words, without any training realised by the user. In this way, the algorithm can be used out of the box and is not related to machine learning. This next step, where the algorithm can train itself to perform better classifications, will be taken in future research.

A basic requirement for good software is its processing speed. In this respect, the algorithm can be customized to meet both reasonable speed and reasonable accuracy demands, as it will be shown in this paper.

Web-based text is generally written in natural language, albeit not just English (exceptions are e.g. chunks of source code). This gives rise to problems in automated text classification because machines are not very good at interpreting natural language. One way to solve these problems is associating keywords related to the text (Golder & Huberman, 2005). As such, graphs can be used to represent the different keywords and the semantic relations between them.

Unlike computers, human cognition uses semantic and syntactic relations between words in order to give significance to natural language. Our method tries to emulate a part of this human cognition by investigating semantic relationships between words in a given text.

2. TEXT CLASSIFICATION METHODS

Common text classification problems appear due to polysemy (i.e. same word has different meanings) and synonymy (i.e. two different words have the same or closely related meaning--they can be used interchangeably in a text, without the meaning of the text being changed).

Most notable text classification methods are related to searching large data collections (Latent Semantic Indexing--LSI, Contextual Network Graphs -- CNG) or filtering data (naive Bayes classifier) (Ceglowski et al., 2004; Kelleher, 2004).

LSI uses a term-document matrix that has to be computed each time a new term or a new document is added, which is a disadvantage. Another downside of the LSI algorithm is the processing time needed to return a result. This is caused by the single value decomposition algorithm that has to execute each time LSI is used. The main advantage of the method is the ability to return relevant results for queries even when there is no exact term match, thus solving the synonymy problem.

CNG is an adaptation of the LSI in the sense that the term-document matrix is transformed into a graph where the nodes are either terms or documents and the edges are the weighted frequency values in the TDM. Each document is linked to all the terms that it contains and each term is connected to all the documents in which it appears. For performing a search, all the nodes that contain the terms in the query are given an energy level of 1. Then, the algorithm spreads this energy to all connected nodes by multiplying the source-node energy with the edge value (the edges have a value between 0 and 1 to ensure that the energy will eventually stop propagating). The advantage of this method is that it is scalable and it can be easily parallelised in order to increase processing speed.

3. SEMANTIC GRAPH CLASSIFIER

Our classification algorithm is based on Contextual Network Graphs, but it each node represents only a word.

To see whether a web page belongs to a certain domain, we follow these steps: we gather a relatively short list of words associated with the field of interest. This short list contains words which are most frequently used in that domain. Then, two other lists of words are created from the words on the short list: one with terms related to the field of interest and one with terms outside the field of interest. From these other two lists we construct two distinct graphs (each node contains a word and related nodes/words are connected by a weighted edge) that have in common only words from the shortlist. For each document to be classified, the value of each node is initialised to 0, while the weighted edge has a value between 0 and 1, depending on the relationship between the words.

The next step is to look up the terms on the short list in the document to be classified. For each term that we encounter in the document, the energy of the associated node in the two distinct graphs is given a value of 1. Then, the energy is propagated from each node with a value above 0 by using the following equation:

energy(dn) = energy(sn) * edge(sn, dn) + energy(dn) (1)

where energy(dn) is the energy of the destination node, energy(sn) is the energy of the source node and edge(sn,dn) is the value of the edge between the source node and the edge node. At this stage, the semantic content of the document is contained in the values of the edges and the newly activated nodes (i.e. have an energy above a predetermined threshold). After this step, all the terms that are activated will be inserted into two new lists that are characteristic for each document: one with terms associated to the field of interest and one with terms which are not associated with the field of interest. The final step is to classify the document based on the number of elements of the two new lists.

For example, if we want to classify a web page as a sports document or as a non-sports document, we would make a short list with words like "sport", "football", "player", "field", "team", "play", "coach", "game", etc. As one can see, some of these words are polysemous and some are synonymous. In a semantic graph classifier, the synonymy problem is solved by creating an edge with a value of 1 between synonymous words. The polysemy problem is solved by using the second graph in which terms are connected to other terms not in the desired field of interest (i.e. sports). Because the value of any edge is not changed during execution, special attention must be paid to choosing the words and weights between the words.

[FIGURE 1 OMITTED]

An advantage of this semantic classifier approach is its robustness to the synonymy and polysemy problems.

4. ALGORITHM IMPLEMENTATION

The algorithm takes as input an HTML web page, outputs its category and was implemented in Python. Figure 1 is a flowchart of the algorithm.

In this implementation we choose speed over accuracy--we remove just stop-words and common words from each document before we apply the semantic classifier, thus only doing minimal pre-processing.

After the document is downloaded, it is striped of some HTML tags (e.g. script, style, span). Then, the keywords on the shortlist are searched in the text part of the document. If found, the nodes corresponding to them are activated (they get a value of 1) in both graphs.

The next step is to propagate the energy from the activated nodes, using equation (1), until each node is updated at most 10 times. At the end of this step, all the nodes that have a value above a predefined threshold are returned by the graph. These new terms are then searched in the document and if found, are inserted into a list (at this stage, there is a list for each graph).

The classification criteria is the number of words that the two graphs return. Documents that return more words from one graph are classified accordingly. Documents that return the same number of words from the two graphs are not classified.

5. RESULTS

To test the algorithm, after several adjustments, the following parameters were used:

* A list of 24 gambling-related words were put on the short

list;

* A list of 47 words semantically related to the words on the short list and to gambling was realised;

* A list of 103 words semantically related to the words on the short list, but not to gambling, was realised;

* Two graphs (one with terms related to gambling and one with terms not related to gambling) were realised; the values of the edges between nodes were assigned based on the frequency of terms encountered in a corpus of 100 documents;

* The threshold for node activation (above which the associated term was returned by the graph) was set to 0.4.

The algorithm was tested against a corpus of 2276 web pages gathered by using Google search engine on the following keywords: "free casino", "online slots", "texas hold'em", "poker", "free poker games", "casino poker". Out of these documents, a total of 1258 were classified as gambling documents, with only 4 documents being misclassified. Out of the rest 1018 documents, 179 documents were non-gambling and classified accordingly, while 839 documents were gambling related, but classified as non-gambling.

This behaviour can be explained by:

* The low number of words on the short list which didn't cover the gambling domain well enough;

* The fact that the algorithm does not include expressions (in this instance, valuable expressions would have been "texas hold'em" or "slot machine"), which does have an impact on the accuracy of the method due to their less polysemous nature;

* The fact that some web pages included text as images.

Each web page was processed in an average of 600 milliseconds on a 3 GHz Pentium 4, 1 GB of RAM computer and Windows XP SP2 operating system.

6. FUTURE RESEARCH

One of solving the polysemy problem is to use a part-of- speech tagger (Bird et al., 2009), which tags each word as a noun, or as a verb, or as an adjective, etc. We plan to introduce this extra information in our semantic classifier.

An improvement of the current method will be by simply allowing expressions in the nodes of the semantic graph.

We would also like to find an artificial intelligence algorithm which changes the graph structure depending on the documents it already classified (thus using unsupervised learning).

7. CONCLUSION

These results show that this kind of semantic classifier has a lot of potential taking into consideration the improvements we want to make. This classification method is good for domains where terms are most likely to have a certain meaning more than any other meanings (i.e. poker and blackjack are certainly related to gambling and not related to any terms outside gambling domain). Counter-examples are documents that need to be classified as violence related, or tabloid related.

This method for web page classification can be used in parental control software.

8. REFERENCES

Bird, S.; Klein, E. & Loper E. (2009). Natural Language Processing with Python, O'Reilly Media, ISBN 13: 978-0596516499

Ceglowski, M.; Coburn, A. & Cuadrado, J. (2004). Semantic Search of Unstructured Data using Contextual Network Graphs, Available from: http://www.knowledgesearch.org/papers/ Contextual_Network_Graphs.pdf Accessed: 2009-04-02

Golder, S.A. & Huberman, B.A. (2005). The Structure of Collaborative Tagging System, Available from: www.hpl.hp.com/research/idl/papers/tags/tags.pdf Accessed: 2009-04-12

Kelleher, D. (2004). Spam filtering using contextual network graphs, Available from: https://www.cs.tcd.ie/courses/csll/dkellehe0304.pdf Accessed: 2009-04-02