文章基本信息

标题：Automatic categorization: How it works, related issues, and impacts on records management
作者：Lubbes, R Kirk
期刊名称：Information Management Journal
印刷版ISSN：0265-5306
出版年度：2001
卷号：Oct 2001
出版社：Institute for the Management of Information Systems

Automatic categorization: How it works, related issues, and impacts on records management

Lubbes, R Kirk

AT THE CORE

THIS ARTICLE EXAMINES:

*how automatic categorization and other document tools are impacting records management

*the strengths and potential limitations of automatic categorization

*the importance of categorization accuracy

A records manager's primary responsibility has always been to process unstructured data. The increase in unstructured documents and the rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new, automated tools at their command.

Automatic categorization is currently being applied to electronic records management. Anyone hoping to effectively apply categorization needs to understand how automatic categorization works, its benefits, its limitations, and the potential impact it has on recordkeeping operations. Ultimately, automatic categorization and other text analytical tools will provide potential new career opportunities for records managers.

In order to better understand the process of automatic categorization, these key terms should be defined:

Categorization: The assigning of an object to a pre-existing subject heading in a file plan or assigning it to a given class within the taxonomy (also called classification)

Cluster: A group of objects with members that are more similar to each other than to members of any other group

Data visualization: A visual representation of corpus contents, often a topographical map or network of linked nodes

Structured data: Fielded data, or data that is generally contained in a relational database

Summarization: An abstract or a synopsis of a document

Unstructured data: Data not contained in fields (e.g., free text, audio, video, and images)

Over the last two decades, the computer's ability to process data has evolved from the domain of structured data to unstructured data. Structured data can include a series of tables with rows and columns. A formal mathematical model, or a relational model, defines the table structures and the complete set of operations that can be performed on the data.

Structured data represents less than 20 percent of the information available. More than 80 percent of all information resides in unstructured documents. Initially, this data could not be processed in its native form. Data elements contained in documents had to be extracted and entered into structured databases before they could be processed. The primary raison d'etre for forms is to be able to easily enter data into database management systems (DBMS). Products exist today to "read" data from forms, including intelligent character recognition (ICR), but this technology is actually processing structured data. For example, ICR depends upon the position of information in the form to determine the DBMS field into which the data is to be entered.

A records manager's primary responsibility has always been to process unstructured data, generally hardcopy documents. To create a file plan, the records manager analyzes a collection of documents and creates a taxonomy that adds structure to the collection. Assigning documents to a file requires an indexing clerk to extract keywords from the document. The creation of a records control schedule requires the records manager to extract the business and legal relevance of a file series. According to a report from Autonomy Corp., the increase in unstructured information is estimated as doubling every three months. The rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new tools at their command. These tools, as well as their advantages and limitations, are discussed later in the article. The focus will be primarily on text-based, electronic records, including e-mail, Web URL documents, Word documents, pdf files, and text documents.

The Electronic Records Environment

An organization that has implemented a standard electronic file structure that is universally followed is extremely fortunate. In most organizations, each person has their own directory structure and e-mail folder structure. Some companies have implemented electronic records management systems (RMS), but most have used a day-forward approach, in which all newly generated and received records are placed under the automated RMS on a certain date. Electronic files on existing servers and electronic records in off-line storage (back-office files) are rarely addressed. Generally, metadata does not exist to place the back-office files under RMS control, and surveying the corpus is cost prohibitive. However, these documents are just as vulnerable to discovery as the newer documents.

Automatic categorization attempts to associate electronic records with either a predefined taxonomy or self-- defining categories. An understanding of the strengths and potential limitations of automatic categorization in managing records is important if it is to be used successfully. A number of text analysis tools act as a suite to assist in this process. These include feature extraction, clustering, visualization, and summarization tools. Commercial off-the-shelf (COTS) products often combine these tools into a single categorization product, but to understand how these products work, it is crucial to understand each of the related technologies. Feature extraction and clustering are integral to the categorization process. Text visualization and text summarization are an adjunct to categorization but are useful in gaining insight into the collection prior to developing categories or to ascertain the quality of the categorization results.

Feature Extraction

The feature extraction process can be viewed as a series of filters through which the document passes. Each filter attempts to further reduce the document to its key conceptual elements and assign numeric values to these elements. The first filter segments the document into individual linguistic components. The next series of filters identify and eliminate words, phrases, and sentences that have low content value. Individual features that describe the remaining content are then identified by the feature extractor. Finally, the feature vector is generated for the feature set.

To the computer, text is a collection of characters and nothing more. Words, phrases, sentences, and paragraphs must be identified through a parsing algorithm. The parser partitions the document into the individual paragraphs and sentences and then into the parts of speech. Computer programs perform these analyses to gain an understanding of the text's meaning, which is determined through these analyses and aids in determining the document's content.

Once text of low information content has been removed from the document, feature identification can begin. One feature type often used is the frequency of occurrence of words or phrases that have high discriminating value. Words that have a high discriminating value relate strongly to the subject of the given document but occur infrequently in documents that have a different meaning. A word may appear in several forms: singular plural, prefixed, suffixed, and hyphenated. Stemming is the process used to normalize all word forms and improves the accuracy of the frequency counts.

Because authors use different words to refer to the same concept to reduce redundancy, there is a negative impact on the frequency count associated with the concept. Therefore, it is often desirable to count concepts rather than individual words. The frequency of occurrence of discriminating words or phrases, along with other metrics, can provide a structured representation, or signature, for the document. These signatures can then be manipulated by the computer and used to assign the associated documents to appropriate categories.

A vector is a physical quantity that has a magnitude and direction. Vectors are represented as a series of numbers, in which each letter represents a magnitude or distance in a given direction. A feature vector is used to represent the document's feature set or the document's signature. Each number in the vector is a magnitude associated with a feature. If a categorization system determined that "file plan," "records inventory," and "records control schedule" were highly discriminating features for the entire corpus that the system was to address, then every document in the corpus would have an element in its associated feature vector for each of these three records-management oriented features. Tables 1 and 2 illustrate this concept.

The five articles are a small corpus. Column 1 lists the titles of the five documents. Columns 2 through 5 are the features that the categorization system has decided to use to determine the feature vectors to represent each article in the corpus. The feature values are the frequencies with which each of the phrases occurs in each article. The feature vectors are listed in Table 2.

Remember that vectors have a magnitude and direction. One of the challenges is to visualize how the categorizer works. Most people can only visualize three dimensions: length, width, and depth. Each feature in the feature vector represents a dimension. Even in this simple example, each feature lies in a six-- dimension space. In a real classification system, feature vectors may have several hundred elements and therefore the vectors exist in a space with several hundred dimensions. Not to worry, is fortunately, the same mathematical principles that work in three dimension space also work in n-space (see Records Inventory Axis figure, next page).

Clustering

Feature extraction results in a collection of feature vectors, one for each document in the collection. If the endpoints of a group of feature vectors are closely grouped together, this indicates that the documents represented are clusters, or are about the same topic. It is possible to determine which subjects are contained in a corpus by calculating the feature vectors for each document and then partitioning the feature vector endpoints into clusters. The user of the clustering software specifies a distance that is acceptable between feature vector endpoints in order to be considered within the same cluster. Each cluster can be envisioned as a sphere in n-space. The center of the sphere is called the centroid, where the radius of the centroid is the user-specified, cluster-defining distance, and this provides a mathematical signature for the topic of the documents forming the cluster.

Visualization

Text visualization views relationships between documents in a corpus. It is difficult to visualize more than three dimensions, and feature vectors may have dimensionality in the hundreds of features. The Pacific Northwest National Laboratory of the Department of Energy is an example of one organization working in the area of text visualization. Two of their products are outlined below: Galaxies and Themeview.

Galaxies provides an overview of the corpus. Each point in the display is a feature vector end-point representing a document in the corpus. The bright areas are the result of documents being closely associated with each other. These are document clusters and represent naturally occurring topics within the corpus. Clusters can be selected for further analysis through the use of Galaxies' analytical tools, which can test whether an organization's file plan is appropriate for a given corpus by comparing the naturally occurring clusters with the documents and terms normally associated with each of the file plan categories. File plan categories can be used to define Galaxies' search criteria, plotting only compliant feature vectors. The resulting graphic can be used to determine whether the documents in the displayed clusters match the categories in the file plan.

Clusters in the display can be selected graphically for further analysis through Themeview, which provides another visualization perspective to view the associations of feature vectors. Themeview also provides a powerful set of analytical tools to support further analysis of the visual display.

There are a number of other products providing text visualization tools and many different approaches that provide the user with a means to analyze a corpus' contents. All of these tools use document features to determine the associations between various documents and the concept of clustering to determine topics or themes.

Summarization

Summarization is another text analysis tool that supports the user in reviewing a large number of documents. The simplest version of a summarization tool provides a title listing of the documents associated with selected areas from the display using available metadata. Key phrases extracted from the selected documents can also be used to generate a list of titles without the existence of metadata. More sophisticated summarization tools use gisting techniques to generate a narrative summary of the document. Summarization techniques can assist the records manager in determining and refining the required training sets for the categorization system.

Categorization

Categorization systems can file documents into multiple categories. This is accomplished by the categorization system utilizing a user-defined distance parameter, which is used as the radius for a sphere surrounding each subject heading's centroid. If the candidate's feature vector end point lies within any subject heading's sphere, it is filed within that subject heading.

While automatic categorization seems straightforward in theory, in practice, it is not. Its accuracy is highly dependent upon the selection of the proper training set for each of the subject headings in the file plan. A training set is the collection of documents selected to represent a subject heading and is then used to determine its associated centroid. This is an empirical problem and one in which the records manager must play the key role.

A set of representative documents is selected for each of the file plan's subject headings. The training function of the categorization system then calculates the centroids associated for each set. The accuracy of the training set selection can be evaluated by automatically categorizing the corpus from which the training sets were taken. If the training sets are perfect, all of the electronic documents previously filed in each subject heading will be assigned to that heading by the automatic classification system. This is very unlikely to happen for two reasons: 1) humans file documents incorrectly and 2) the training sets are not perfect. The training sets can be tuned by deleting documents and adding others until acceptable results are accomplished when tested against the existing electronic file plan. This process should be performed by the records manager. Once acceptable results have been accomplished, new documents can be categorized using the centroids created by the training set.

Automatic categorization systems can generate their own sets of categories through the use of clustering. These self-defined categories can then be fine-tuned by stripping out documents that are not relevant. The refined set of documents can be used as a training set, recalculating the centroids for each category. The finetuning of the categories should be performed by the records manager, who should determine if the selfdefined categories meet business needs.

The level of investment in building a training set should not be underestimated. Generally, the size of the training set is directly proportional to the accuracy of the automatic categorization. Given that cost estimates of reclassifying documents range from $25 to $100 per document, the cost of building a training set can be a significant one-time cost. Not developing a representative training set, however, will result in a significantly higher reoccurring cost.

Categorization accuracy is an important issue with automatic categorization systems. A categorization accuracy of 80 percent is considered fairly good for an automatic categorization system. However, this metric relates to a single level of the taxonomy. Statistically, the accuracy at any given level of the taxonomy would be accuracy of the categorization system raised to the power of the hierarchy level in which the subject heading resides. For example, if the subject heading was in the third level and the accuracy of the categorization system was 80 percent, the expected accuracy for the proper assignment of a document would be about 51 percent.

Records Management Implications

Understanding the strengths and potential limitations of automatic categorization is important if it is to be used successfully. The records manager must play a key role in establishing an automatic categorization system. Only the records manager understands the filing system and its applicability to the business enterprise, and this knowledge must be imbedded into the categorization system's knowledge base. The records manager must learn new skills to use this important tool in order to make a contribution toward its integration into the records management process.

Editor's Note: The products mentioned in this article serve as examples and do not constitute endorsement by ARMA International.

REFERENCES

Autonomy Corp. "Autonomy Technology White Paper." www.autonomy.com.

Hobbs, Jerry R "Generic Information Extraction System." Artificial Intelligence Center SRI. www.itl.nist.gov/iaui/894.02/related-projects/tipster/gen_ie.htm.

"Feature Extraction." www.case.ogi.edu/class/cse580ir/handouts/6-20/text_pcessing/sId014.htm.

"SPIRE: A new visual text analysis tool." www.pnl.gov/infoviz.

Turban, Efraim. Decision Support and Expert Systems, Management Support Systems, 2nd Edition. Prentice Hall. Upper Saddle River, New Jersey, 1990.

ABOUT THE AUTHOR: R. Kirk Lubbes, CRM, is President of Records Engineering LLC. He has held positions at Pattern Analysis and Recognition Technology, The Analytical Sciences Corp., and Disclosure Inc. and has managed programs for the National Security Agency, Central Intelligence Agency, and the Air Force in the areas of sensor exploitation, information retrieval, text processing, data visualization, and records declassification. Lubbes has been working as an information technology contractor and consultant for more than 35 years. He can be reached at klubbeskrecordsengineering.com.