文章基本信息

标题：Working with data: Discovering knowledge through mining and analysis
作者：Qin, Jian
期刊名称：Bulletin of the American Society for Information Science
出版年度：2000
卷号：Oct/Nov 2000
出版社：American Society for Information Science and Technology

Working with data: Discovering knowledge through mining and analysis

Qin, Jian

The definition of knowledge discovery in databases (KDD), given by Fayyad, Piatesky-Shapiro and Smyth in their artitle "Data mining and knowledge discovery in databases" in the Communications of the ACM [(39(11) 1996, 24-26], is probably the most frequently cited in the KDD literature. They defined KDD as a "nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." In their view, data and patterns are the starting and ending points, respectively, in a KDD process, with large volumes of data processing, iterative testing and analysis in between. Where the process is considered nontrivial, the data analysis goes beyond mere quantitative computing; the goal is to search for structures, models, patterns, associations or parameters. The end result, "patterns," should be valid for new data with some degree of certainty. The patterns should also be novel and potentially useful for users.

Information professionals work with databases everyday. These databases number in the hundreds and thousands and store millions of records and documents. The patterns that can be uncovered from such databases belong to certain types of knowledge. These types were defined by Thomas H. Davenport and Laurence Prusak in their 1998 book Working Knowledge as a

... fluid mix of framed experience, values, contextual information, and expert insight that provides a framework for evaluating and incorporating new experiences and information. It originates and is applied in the minds of knowers. In organizations, it often becomes embedded not only in documents or repositories but also in organizational routines, processes, practices, and norms [p.5].

Databases contain current and/or legacy data on business transactions, directories, personnel, products, publications, scientific field observations and experimental records, the list can go on. But to convert these data into "a fluid mix of framed experience, values, contextual information, and expert insight," it involves some "non-trivial" processes of computation and analysis.

KDD processes are usually heavy in mathematics and statistics, but most of these computational tasks can be achieved by using some user-friendly yet powerful computer software. Though the computer software will take care of the math and statistics for us, certain things will have to be worked out through the human brain. For example, we will need to figure out for ourselves what data we want to feed into the software and what kinds of patterns we expect to find, as well as decide whether or not the result is valid, novel, potentially useful and understandable.

This special section of the Bulletin of the American Society for Information Science, which focuses primarily on KDD applications in knowledge management and information retrieval, provides insight into the importance of human judgment in the KDD process.

As an introduction Igor Jurisica's paper presents a concentrated overview of KDD concepts, processes and applications in the domain of knowledge management. He starts with a discussion of how KDD relates to knowledge management systems and how it can support knowledge management. He provides an overview of basic KDD concepts and indicates that knowledge management in complex and dynamic domains benefits from extending traditional approaches with automated methods. Finally, he gives examples of applications of knowledge discovery that are relevant to knowledge management operations.

We next turn to an example. Text mining, as Elizabeth Liddy explains in her paper, is a sub-specialty of the broader domain of KDD. The nature of text data, be it structured or unstructured, of academic writing style or spoken language fashion, makes it a natural candidate for natural language processing (NLP) applications. Liddy describes in her paper the relationships between NLP and text mining and how NLP can be applied in mining useful knowledge from text data, where the "useful knowledge" is used to provide a broader range of information access and analytical capabilities in information retrieval systems. The role of NLP in the text-mining process involves analyzing and representing naturally occurring texts at all levels of linguistic analysis for the purpose of achieving human-like language processing. The three steps of a text-- mining process bear some resemblance to the general KDD process described in Jurisica's paper, but the objects and methods for processing are unique to NLP.

Continuing our examination of IR-related applications, Jim Jansen and Amanda Spink discuss how to uncover useful information about user search patterns on the Web. The data sources for their KDD task are search logs from the Excite information retrieval system, a major Web-based search engine. They describe the challenge of obtaining sets of analyzable data from this huge database and some of the results of their project.

Uta Priss strikes a more technical note in her contribution. Determining the level of support for discovered patterns is an important procedure in a KDD process. That is, if you discover an association rule in your data-mining project, how can you decide if this association rule is nontrivial? And if it is not trivial, how can you measure the significance of your finding? Priss demonstrates how a method - Formal Concept Analysis (FCA) - can be used to create computational algorithms that will help users find out how strongly an association rule is supported.

The last paper in this section is a perspective paper by Jay Norton. Databases are data mines and the source for KDD processes. Many of these databases were not designed specifically for KDD, which creates some obstacles for KDD analysis. Norton looks at all the aspects of a database - design, structures, implementation, data entry and collections, and users - in relation to how these aspects might affect KDD. The obstacles that result from database design and data collection defects, in turn, create difficulties for automatic KDD in generating usable data sets and reliable results. In response to these challenges, Norton's paper emphasizes the need for human intelligence in planning, implementing and evaluating KDD methods and processes.

Suggested Reading about Knowledge Discovery in Databases supplied by Igor Jurisica

Bramer, M.A. (Ed.). (1999). Knowledge discovery and data mining: Theory and practice. London: Institution of Electrical Engineers (lEE).

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P (1996). From data mining to knowledge discovery in databases, Artificial Intelligence Magazine, 17(3), 37-54.

Freitas, A. A. & Lavington, S.H. (1998). Mining very large databases with parallel processing. Boston: Kluwer.

Mena, J. (1999). Data mining your Website. Boston: Digital Press. Michalski, R. S., Bratko, I., & Kubat, M. (1998). Machine learning and data mining: Methods and applications. New York: Wiley.

Murty, M. N. & Jain, A. K. (1995). Knowledge-based clustering scheme for collection management and retrieval of library books. Pattern Recognition, 28(7), 949-963.

Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann.

Wang J. T. L., Shapiro, B. A., & Shasha, D. (Eds.) (1999). Pattern discovery in biomolecular data: Tools, techniques, and applications. New York: Oxford University Press.

Witten, I. H. & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann.

Web resource

KDNuggets Web repository. www.kdnuggets.com

Jian Qin is in the School of Information Studies, Syracuse Universitv. 4-232 Center for Science & Technology. Syracuse. NY 13244: telephone: 315/443-5642; Jax: 315/443-58;06: e-mail: jqin@yin Ca).qyn.edt