文章基本信息

标题：So you want to implement automatic categorization?
作者：Lubbes, R Kirk
期刊名称：Information Management Journal
印刷版ISSN：0265-5306
出版年度：2003
卷号：Mar/Apr 2003
出版社：Institute for the Management of Information Systems

So you want to implement automatic categorization?

Lubbes, R Kirk

TechTrends

Automatic categorization can be a powerful tool despite its limitations, but it is still important to test and evaluate before making a commitment to using it

The massive amounts of data available through the Internet, extranets, and internal corporate databases have created the need for new techniques to organize information. An excellent example is automatic categorization, an information management tool designed to assist enterprises in filing and retrieving the vast numbers of electronic records that they generate or use today. Automatic categorization attempts to assign electronic records to either predefined file structures or to self-defined categories through computer-based processes.

Successful use of automatic categorization requires the melding of technical and records management perspectives. Insight into the theory behind various vendor implementations, the benefits and limitations of each, and an understanding of what is "under the covers," will aid records and information managers in making intelligent decisions in selecting and implementing automatic categorization.

Automatic categorization technology has two principal approaches: pattern matching and rule-based systems. Pattern matching systems use word patterns and concepts within the electronic record to associate the record with a predefined file structure. Pattern matching systems can be further divided based on the technique used to associate patterns with a given category. The four principal pattern matching techniques used by vendors today are k-nearest neighbor, Bayesian, neural networks, and support vector machines.

Rule-based systems depend on a userspecified set of rules to associate the occurrence (or exclusion) of names, phrases, or concepts contained in documents with specific file plan subject headings. The computer parses the document, identifies the user-specified entities, and assigns the document to the appropriate category based upon the rule set.

Pattern Matching Approaches

Pattern matching categorization requires providing the system with representative sample documents for each subject heading in the file plan. Using the sample documents, the categorization software generates an internal representation for each subject heading. The software compares any new documents entering the system to internal subject heading representations and assigns the new document to the subject where it fits best. There are two phases to this process: the training phase, which consists of providing sample documents, and the classification phase, in which new documents are assigned.

The training phase requires the records manager to identify document sets that represent each subject heading in the file plan. This is the training set. Identifying a training set is an empirical problem, one in which the records manager's knowledge of existing records and the current file structure is critical to automatic categorization's success. The manner in which the training set is used differentiates each of the four pattern recognition techniques.

In the classification phase, new documents entering the system are assigned to one or more categories using algorithms fine-tuned during the training phase. This category assignment is equivalent to the records manager Is indexing the document in order to assign it to a specific subject heading.

Software developers currently use four primary methods to assign documents to subject headings (categories). The methods are drawn from various mathematics and computer science disciplines. The k-nearest neighbor algorithm is based upon algebra and geometry. Bayesian modeling uses probability theory. Neural networks are an outgrowth of the computer science field of artificial intelligence. Support vector machines (SVM) are founded on machine learning theory.

K-nearest Neighbor

K-nearest neighbor is the easiest categorization approach to understand because the mathematics it uses has a physical analog in the real world.

In k-nearest neighbor, the records manager constructs a training set. The categorization software produces an internal representation in which each document in the training set is a point on a graph. The training set clusters graphic below shows the three-dimensional graph of a training set as produced by the product SERprivateBrain Learnset Viewer. The viewer allows humans to visualize a training set the same way that the software does. The points, representing the documents in each category, form groupings called clusters.

In the training set clusters example, the file plan has five categories, each containing documents created from five different books (The Age of Reason, Holy Bible, Dracula, Moby Dick, and Zarathustra). There are five clusters one for each book title - represented by different colors as shown in the legend. A cluster exists for each category (as it would for each file plan heading). In the k-nearest neighbor approach, the software generates a sphere that contains the documents (points) in the subject heading (cluster) and calculates the center of the sphere. The simulated spherical k-nearest neighbor boundaries graphic above shows spheres drawn around the clusters representing Moby Dick and The Age of Reason to illustrate this concept. The center of the sphere is called the centroid. The centroid represents the subject heading in the file plan to the computer.

New documents are filed (categorized) during the classification phase. The radius for the centroid sphere defines the maximum distance that any new document's representative point can be from that subject heading's centroid. When the point associated with a new document falls within any cluster's sphere, the document will be filed in that associated subject heading.

Bayesian Modeling

The Bayesian modeling approach is based on the concept that knowledge about the distribution of previous outcomes helps determine the probability of current outcomes. In automatic categorization, Bayesian modeling asserts that if the assignment of a set of documents from the corpus (the training set) to categories (subject headings) is known, then this information can aid in predicting the assignment of a new document to the appropriate category. In other words, information obtained from the training set can be used to fine-tune a statistical model so that it can assign documents to categories.

One of the Bayesian model's drawbacks is that it assumes that words or phrases are independent. For example, the words records and manager, when considered independently, have a different meaning than the phrase records manager. Thus, the records and manager are, therefore, not independent. Bayesian modeling generally provides reasonable categorization results. However, there is no way to guarantee the independence between all terms that the Bayesian model will use to assign documents to subject headings.

Neural Networks

The basis for neural networks, more correctly called artificial neural networks, lies in the computer science field of artificial intelligence. Neural networks are the result of attempts to model the human brain. In the general case, a neural network can accept a document described by its differentiating words and phrases and classify it into a predefined set of categories.

The neural network must be trained to assign the document to a category. Other pattern recognition systems use mathematical formulas to extract the parameters from the training set during the training phase.

Neural network training is unique in that it uses a trial and error method for determining the parameters it will use to assign documents to various categories.

One of the drawbacks with the neural network approach is its complexity. If large numbers of words and phrases are required to differentiate documents into the various categories (i.e., the file topics are closely related), the network becomes large, increasing the processing power required to both train and use the system. The question of scalability should be investigated before setding on neural network-based categorization software.

Support Vector Machine (SVM)

One later entry in the stable of automatic categorization systems is a mathematical technique called support vector machines (SVM), which for purposes of explanation may be considered an enhancement to the k-nearest neighbor approach.

The k-nearest neighbor approach assumes that a sphere will appropriately represent the boundaries between various subject headings for electronic documents, but this may not be the case. An irregular shape might better represent the subject heading boundaries that the software will use to determine the best assignment for a new document.

The simulated SVM boundary graphic below shows a conceptual closed surface that might be generated by the SVM approach as a boundary for the Moby Dick cluster. Compare this boundary to the Moby Dick spherical boundary in the simulated spherical k-- nearest neighbor boundaries and notice that there are several Dracula and Zarathustra documents erroneously classified in the Moby Dick spherical boundary; this number is greatly reduced by the SVM boundary.

Spheres may overlap, but irregular shapes may better partition the file headings. Overlap is not necessarily bad because the document in many cases should be considered in multiple categories. The objective of categorization techniques, however, is to associate the object being categorized with the single "best" category. SVMs attempt to offset these two issues. While SVMs are showing some promise in automatic categorization, lead researchers in the field indicate that SVMs are no "silver bullet."

* "Discover More: A Technical White Paper on the Stratify Discovery System" notes that "SVMs pay more attention to outlying training documents. Given high-quality training sets, SVMs will focus on the crucial documents that help define the borders of the group. With poor quality training sets, however, they tend to focus on erroneous outliers, and their performance suffers markedly."

* In an IEEE Intelligent Systems Magazine article, "SVMs - A Practical Consequence of Learning Theory," B. Scholkoph wrote: "We are still missing an application where SVM methods significantly outperform any other available algorithm or solve a problem that has so far been impossible to tackle."

The Rule-Based Systems Approach

Rule-based automatic categorization systems represent a different approach in that they do not require training. According to Fabrizio Sebastiani's A Tutorial on Automated Text Categorization, "Rule-based systems are popular because users of these systems can precisely define the criteria by which a document is classified. Rule-based systems can support complex operations and decision trees and produce very accurate results."

The rule-based approach's first phase is the definition of a set of rules that will be used during the classification phase to assign documents to subject headings. The rules are defined in the form of "IF conditional statement, THEN action."

Rule-based systems, unlike most pattern-recognition approaches, can also take advantage of document metadata to improve categorization accuracy (e.g., rules could be created that would assign all of the documents written by a particular author or written during a given date range to specific categories).

The rule-based system organizes the user-provided rules into a decision tree. A decision tree will only assign a document to the appropriate category if the rules are consistent; if they conflict with each other, the appropriate decision may not be reached. For example, rules 1 and 3 in the example below are not consistent. If the "nuclear" condition was the first tested by the decision tree, then rule 1 would classify the document in the "nuclear power" category even if the phrase "records manager" appeared later in the document. Rule 3, which would have properly classified the document, would not be tested because rule 1 asserted that only the condition "nuclear" was required to classify the document.

As the number of categories increases, the number and complexity of the rules must also increase to differentiate between categories. The number and complexity of the rules is also a function of how closely categories (subject headings) in the corpus are related to each other. For example, if documents in the corpus only contained two categories, records management and nuclear power, it might be possible to differentiate between them using rules 1 and 2, More complex rules, such as rules 3 and 4, would be needed to properly categorize documents discussing the management of nuclear records versus documents about managing nuclear power projects.

A major issue with rule-based systems is that the rules required to differentiate subject headings for large, complex collections of documents become difficult to manage and maintain consistently. In the article "Using SMVs for Text Categorization," S.T. Dumais wrote:

[Another] drawback of this "manual" approach to the construction of automatic classifiers is the existence of a knowledge acquisition bottleneck, similar to what happens in expert systems [a type of rule-based, artificial intelligence system]. That is, rules must be manually defined by a knowledge engineer with the aid of a domain expert (in this case, an expert in document relevance to the chosen set of categories). If the set of categories is updated, then these two professional figures must intervene again, and if the classifier is ported to a completely different domain (i.e., set of categories), the work has to be repeated anew.

The knowledge engineer is someone familiar with capturing rules and the internal workings of the vendor's rule-- management system. The domain expert is the records manager. More than in any other automatic categorization approach, the records manager's understanding of records content and the file plan is critical to the success of a rule-based approach.

Automatic categorization technology, while the result of significant research and development, must prove cost effective in an operational environment. It can be successful only if the proper records management tasks are performed. (See sidebar below.)

Importance of the Training Set Definition

The accuracy of all automatic categorization systems is highly dependent upon the effort and care taken during the training or rule definition phase. In systems using pattern matching, the selection of a training set that accurately represents the content of each subject heading in the file plan is critical. Vendors suggest that training sets should contain from 10 to 50 documents for each subject heading. However, training set quality is more dependent on the content of the documents than on the number of documents per subject heading.

The content of documents selected for training sets should be highly representative of the subject headings for which they are chosen. It may even be necessary to create documents that meet this requirement. For instance, an existing electronic document might contain a paragraph that is not focused on the document's primary subject, but otherwise might be highly representative of the subject heading desired. Rather than using the document as is for the training set, the records manager could electronically copy the document, assign it a different but easily recognized name, delete the paragraph or paragraphs not consistent with the general subject, and use the newly created document for training purposes. After completion of the software package's training phase and testing phase, the created document could be deleted from the corpus.

For products that use Bayesian modeling, care should also be taken to create a training set where the ratio of documents associated with each subject heading is consistent with the number of documents under each subject heading in the original corpus. In other words, if a specific subject heading in the original corpus contains 7 percent of all the documents, then the training set for that subject heading should ideally contain 7 percent of the documents in the training set.

Guidelines for creating training sets include:

* The larger the number of subject headings in the corpus, the larger the training set should be.

* The more difficult it is to discriminate between subject headings in the corpus, the larger the training set should be.

* It is better to have fewer highly representative documents per subject heading than many poorly representative documents for the subject heading.

Training set creation is normally an iterative process. After the initial training phase is completed, a collection of test documents, whose proper assignment to subject headings is known, is input to the classification phase. The records manager then evaluates the classification phase's accuracy. If the results are not acceptable, the training set is modified, the system is retrained, and the test is repeated. The training set modification involves placing misclassified documents into the training set, associating them with the correct subject heading, retraining the system, and rerunning the test. Only after an acceptable misclassification error rate is accomplished can the training phase be considered complete.

Importance of Rule Definition

In rule-based systems, the iterative process of defining rules, testing, tuning, and re-testing rule performance against a test corpus is critical. The records manager's understanding of the document collection is the key to success in selecting the training set or defining the rules.

Rule-based automatic categorization system vendors generally supply software to aid the records manager in defining and managing rules. The records manager may not even be aware that he or she is generating rules; rather the records manager specifies words and phrases that, when encountered in a document, either cause the document to be assigned to a given category or prevent it from being assigned to the category. Some rule-based package vendors bundle pre-established taxonomies that have key words and phrases related to each category already defined. The records manager can fine-tune these lists by adding or deleting words and phrases. The software generates the rules and the associated decision tree for use in the classification phase.

Once the records manager has defined a set of rules, system performance, or rule accuracy, must be tested. The records manager runs the system's classification phase using an input test corpus where the proper categorization of the documents is known. The system's categorization results are then compared with the known, or correct, categorization of the test documents. If the level of error is not acceptable, the user must adjust the rules. Much like the development of the training set, this is an iterative process and must be repeated until an acceptable level of categorization accuracy is reached.

Is Automatic Categorization Ready for Prime Time?

Automatic categorization systems have limitations, as evidenced by an experiment performed by Microsoft Research. Microsoft used automatic categorization software to categorize 12,902 Reuters news stories into 118 categories. The stories had previously been manually filed into the 118 categories, providing a baseline for comparison with the automatic classification software results. The experiment was unusual in that it used 75 percent (9,603) of the stories to train the system to classify the remaining 25 percent (3,299), far in excess of the number traditionally recommended by vendors to train their systems, according to Dumais.

But there is a limit to the accuracy of automatic classification, even when trained with large numbers of documents, according to "Verity Intelligent Classification, Turn Information Assets into Competitive Advantage," a report by P. Prahahar Paghavan.

According to a study by Microsoft Research, over 9,000 documents were required to teach Bayesian and neural network-based systems to classify new data with a maximum first category-level accuracy of 80 percent. When data is broken down into subcategories, the accuracy drops even further. For example, with only 80 percent of information correctly categorized in the first level of a hierarchy, at the second level only 64 percent will be in the appropriate subcategories (0.8 at the first level multiplied by 0.8 at the second, multiplied by 100 to convert to a percent). Accuracy drops to 51 percent at the third level, and 41 percent at the fourth. At $25 to $100 per document to reclassify manually, this is an expensive problem to fix. To limit their accuracy problems, some automatic classification systems restrict taxonomies to two levels. This solution attempts to limit the proportion of misclassified documents to 36 percent - over one-third of the entire corpus.

These results are somewhat disappointing because many enterprise-wide file plans have far more than 118 subject headings, as well as multiple subject heading levels. Yet some researchers have reported success in the use of automatic classification systems. D. Schewe's article "Classifying Electronic Documents: A New Paradigm" points out that, despite its limitations in some application areas, automatic categorization can be a viable tool for supporting records management. According to Schewe:

Of the more than 90,000 e-mails and word processing documents [analyzed at the U.S. Department of Education], the records manager was unable to find even one whose categorization would have resulted in an incorrect retention period. The software did not successfully categorize all documents. Each cluster map provided for incorrectly categorized documents, which were examined as part of the process. Most were short documents that defy categorization even by the most experienced records manager.

The difference in the results of the Microsoft and Schewe examples may be explained by the scope and evaluation criteria of the two different applications. Schewe limited the number of categories to a relatively small number: Focus would be not on all possible data within the agency but instead on individual work groups where the number of subjects addressed was limited by the work group's scope. Focusing on a particular office or work group keeps the number of clusters and subclusters smaller and, therefore, easier for the records manager to work with.

Also, Schewe's measure of success was with respect to retention period, not content.

For the demonstration project, the records manager examined all documents and e-mails to see that the software properly categorized them. Proper categorization was defined as ensuring that all documents that should be saved for a certain period of time according to the records retention schedule were placed in categories that were scheduled for that period, or longer.

Schewe's article illustrates that there are potentially useful records management applications of automatic categorization despite the limitations in current software systems. Schewe ameliorated many of these limitations by focusing at the work-group level rather than at the enterprise level, thus restricting the number of categories and concentrating on retention period as the key classification category rather than on subject heading.

Re-thinking Old Paradigms

Filing and retrieval of vast amounts of electronic records will require using automated tools such as automatic categorization. Records managers may have to re-think the current records management paradigm to facilitate the practical use of new automated tools in order to meet the legal and operational requirements necessitated by electronic records.

There may be a large class of problems where automatic categorization is a powerful tool even with its limitation on accuracy. After all, the purpose of filing is to be able to retrieve information and support record retention requirements. In practice, combining automatic categorization and full-text indexing may make it possible to meet the information filing and retrieval needs of a large organization with fewer, well-differentiated categories. Automatic categorization also can be a powerful tool with which the cost of miscategorization is not significant.

Deciding whether to adopt automatic categorization will require records managers to

* identify the specific application, e.g., retrieval, browsing, discovery, information organization, routing, filtering, retention management

* determine an acceptable level of categorization error for the given application

* evaluate any policy changes that might improve categorization performance (e.g., using fewer file subject headings for electronic files or managing at work-group levels)

* experiment with and evaluate multiple products, spending maximum effort on defining the training set or rules

* estimate the life-cycle costs of maintaining the system (e.g., the cost of defining new rules and/or refiling to accuracy)

* select the best tool based on empirical results and cost/benefit analysis

The number of products that provide automatic categorization continues to increase. As with most new technologies, the market is still shaping up; new entrants, mergers, and acquisitions make it difficult to keep track of product names and owners, but it is important to research various products. Many vendor Web sites also contain white papers that provide additional helpful background information on automatic categorization.

Products have evolved from different applications of automatic categorization: portal management, information retrieval, information management, and records management. A specific vendor product may be stronger than others for a given application; it is worthwhile to understand which market the specific product is attempting to address. Some records management software packages include automatic categorization functionality. Records managers should work with their information technology departments when considering purchasing a given product to determine its ease of integration with existing products.

According to Geoffrey Bock, "There is no commercially oriented benchmark for determining the effectiveness of one particular text-analysis solution or another. Thus, a company choosing between [a product] and its competitors has to do extensive comparisons on its own to determine the costs and benefits of alternative approaches."

In the final analysis, one should not expect miracles from automatic categorization. Look for applications where it can assist in improving efficiencies while taking into account its limitations. Determine the requirements and measures of success for the given application. Test and evaluate before making a commitment. Remember that humans make a significant number of errors when filing. Human accuracy, not perfection, should be the benchmark for measuring automatic categorization.

References

Bock, Geoffrey. "Meta Tagging and Text Analysis from ClearForest, Identifying and Organizing Unstructured Content for Dynamic Delivery through Digital Networks! Patricia Seybold Group. 21 February 2002.

Dumais, S.T. "Using SVMs for Text Categorization." IEEE Intelligent Systems Magazine. July/August 1998.

Lubbes, R. Kirk. "Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management.' The Information Management Journal. October 2001.

Meyers, J. "Automatic Categorization, Taxonomies, and the World of Information: Can't Live With Them, Can't Live Without Them:' E-does. November/December 2002.

Paghavan, P. Prahahar. "Verity Intelligent Classification, Turn Information Assets into Competitive Advantage." Verity Inc. November 2000.

Schewe, D. "Classifying Electronic Documents: A New Paradigm." The Information Management Journal. March/April 2002.

Scholkoph, B. "SVMs - A Practical Consequence of Learning Theory." IEEE Intelligent Systems Magazine. July/August 1998.

Sebastiani, Fabrizio. A Tutorial on Automated Text Categorization, Istituto di Elaborazione dell 'Informazione, Consiglio Nazionale delle Ricerche,Via S.Maria,46 56126 Pisa (Italy).

Stratify Inc. "Discover More: A Technical White Paper on the Stratify Discovery System." August 2001.

READ

Lubbes, R. Kirk. "Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management." The Information Management Journal September/October 2001.

R. Kirk Lubbes, CRM

R. Kirk Lubbes, CRM, is President of Records Engineering LLC, in Reston, Virginia. He may be contacted at klubbes@recordsengineering.com.