Metadata - Information about electronic records
Phillips, John T JrOne of the most challenging issues facing information professionals is an explosion in the varieties of information to be managed. Gone are the days when a large complex business could survive by processing paper-based checks, purchase orders, or correspondence. Technologies such as electronic funds transfer, electronic data interchange, and electronic mail are making some paper-based transactions and communications both quaint and costly. As an example, many financial institutions increasingly prefer that their customers use automated teller machines (ATMs) for financial transactions, because ATMs save labor costs for the institutions and reduce the time that customers must spend waiting in line. However, information that is processed entirely electronically must be documented to provide evidence that the transaction actually occurred and to assure protection of the information assets of the enterprise.
How does one describe an electronic record, a computer data file, or even a piece of paper that contains information of interest for use or preservation? Does one employ the same terminology to discuss information about a person's name in a computer's mailing label database as one might use to describe the same data residing in a formatted "box" on a paper form? Is the text "John &. Public" the same information regardless of the media (paper or electronic) or the data format (first name followed by last name)?
Answers to these questions will have a major impact on all information management professions due to a shocking realization--information will increasingly be managed without allowing its ephemeral physical format to dominate information. management procedures and techniques] Content transcends form. Information value is derived from the usefulness of the data, not the intrinsic value of the media on which the data is stored. Information can be more useful, informative, and inexpensive when managed in a manner that avoids a reliance on any one physical storage medium. Information management principles and procedures should be applicable to all media, thus allowing the user to choose the best medium for creation, use, and storage. If paper-based information works the best--use paper. If optical disk storage is best--use that technology. However, the principles of information management should be the same.
DESCRIBING DIGITAL DATA
To manage information seamlessly, without regard to physical format, requires developing a generic means of describing information. This "information about information" serves as a tool for managers and users of data to get a handle on the raw materials with which they work. Such an information description standard can be used to develop integrated methodologies for creating, using, and storing paper documents, complex electronic records, or unformatted data in databases. Without a generic means of describing information, it will be very difficult to inventory digitally stored information or to compare electronic documents to their hard copy counterparts. To create a records retention schedule that applies to both computer-based records and paper files requires an ability to specify a report, purchase requisition, memo, or letter in a manner that makes the physical recording medium secondary to the information content.
For example, some common "data elements" used to describe a file of correspondence could be the series name, series description, records date, records volume, expected growth rate, frequency of use, retention status, media description, and requirements for use. These "data elements are the building blocks for all data processing systems."(1) A data element is a "unit of data that, in a certain context, is considered indivisible.(2) For both paper documents and computer disk files, the series name, description, date, growth rate, frequency of use, and retention status should be identical. The series name could be "internal correspondence," the description could be "memos and letters," the date might be "1/1/94-12/31/95," the growth rate might be "10 files per week," the frequency of use might be "weekly," and the retention status could be "retain for five years." These data elements and their descriptions can be used to apply to both computer files or paper files, thus creating a common description of the information contained in a document or file that is not media dependent. The raw data itself is considered to be a "value" or "instance" of the data within its expected "domain" or "range" of values. "For example, green, 4'6", 5 seconds, 404-33-4059, and tall are all examples of data. The set of all possible legal values that a data element can assume is referred to as its domain."(3)
The data elements that are used to describe the physical and operational aspects of the files begin to give insight into how the same information characteristics can be used to describe different media. The volume, media description, and requirements for use will contain different descriptions for paper documents as opposed to electronic files. For instance, paper file volumes might be described as "four file boxes." Electronic files might be described as "4 megabytes of data in 175 files." A media description data element could help differentiate between "floppy disks" for digitally recorded information and "standard letter-sized paper" for documents in file folders. A requirements-for-use data element will clearly distinguish between records that "require MS-DOS, Windows, and Word for Windows software" and those paper records that have "no special requirements" for viewing or use. These description fields give insight into the fact that some files are hard copy, whereas, some other files can be data that were electronically recorded. They also provide documentation of the computer system or software required to view the electronically recorded files. The important factor is the same information fields (data elements) are captured for all information of records, regardless of format.
The term "metadata" has been used often since the early 1980s by the computer software and systems development community to describe the information required to document the characteristics of information contained within databases.
"Data about data are referred to as metadata."(4) The definition has not changed much over the years-'In databases, data that describe data objects."(5) Field names, lengths, types, and other characteristics are all "data about data" and the term "metadata" implies "higher level information." Information systems developers must know very precisely how long a database field (or data element) is to determine how wide a report page must be to display the information contained in that field. They must also know if the information is to be used for calculations numeric data) or just to be stored as a character string of searchable text. In some cases, the information contained within the data element may be a "binary large object" or BLOB. In this case, for the software to display that information, it may have to call on a special program to display the BLOB for view by the system user. All of these data characteristics must be documented so that teams of programmers can have a common knowledge of the data with which software must interface. However, most metadata used by software developers is focused on helping them get a handle on the interaction between data elements in a database and the software programs that will access the data. As an example, they are interested in such questions as "Can a single purchase order contain more than one item to be purchased?" (Such a relationship would be considered a one-to-many relationship between the vendor data and the item data--one vendor and many items.) All of this information is generally stored in a computer database called a "data dictionary." "In the broadest sense, a data dictionary is any organized collection of information about data."(6) Systems developers are rarely interested in how long the data within the purchase order must be kept to meet the requirements of a business procedure or a government regulation. In this mode, they are similar to a person with a shovel that is digging a trench according to specifications--two feet wide and four feet deep. Where the trench leads or what it will be used for they may not know or care about]
For this reason, many archivists and records managers are now adopting these concepts to describe electronically recorded information, and then elaborating on these issues to incorporate a more robust set of descriptive data elements that suits their needs. This "superset" of data elements will allow both information systems developers and information managers to interact better during the design and creation of new software systems. Many of the concerns of information managers arise from the organizational, regulatory, and environmental factors governing the use and disposition of electronic information residing in computer databases. These factors are often not of immediate interest to software designers who are mostly concerned with computer system internals or a user interface--maximizing data in and data out is their focus. Users and managers of information have additional information description requirements for meeting their professional responsibilities. These additional requirements call for being able to document who created the data, when it was created, if the data has ever been changed, and whether or not the information is considered vital to the organization of origination. These information documentation requirements are rarely of interest to software programmers or systems designers who assume it is natural for data constantly to be changing.
CREATING CUSTOM TOOLS
To design a generic method of describing all information that applies to both electronic environments and hard copy paper documents requires establishing parameters for information descriptions that apply across all relevant computer system platforms and operating business environments. Such parameters will be similar across business activities, with some unique parameters designed to capture particular information aspects that are of significant interest in special settings. Information is valued based not only on the data content, but also on the perspective or view from which the information was generated. An electronic mail message that says "Please discard all old procedures manuals" can have a very different importance and impact depending on who issued the memo. Did it originate from executive management, the records management department, or a new (and uninformed) department head? This contextual information must be captured to fully appreciate the value of the message. Who issued the message, when was it issued, why was it distributed, and who received the message? All of these factors affect the authenticity and value of that communication.
For information to be managed to meet all organizational and governmental requirements will require new methodologies for documenting records creation and use throughout their life cycle. Unfortunately, many of these tools are only in a rudimentary state of development in today's software environment. There is no common set of accepted data elements that can be used to describe a paper letter, a corporate report, data in a computer database, or an electronic mail message. Common tools that are used to identify information on a computer's disk drive are user-assigned filenames or keywords assigned by a particular software. For instance, a microcomputer using the MSDOS operating system would allow assigning a filename like BDGT95RM.XLS to a spreadsheet (XLS Excel filename extension) that was the budget information (BDGT) for 1995 (95) for the records management (IM) department. However, unless one was already familiar with the abbreviated nomenclature used to name this file, it is unlikely that one could decipher the content of the file without loading the file onto a computer for an actual visual inspection. Some modern software will also allow adding keywords, titles, or abstracts that are searchable by a computer user. For the most part, these do not provide a sufficient number of descriptive fields also to designate a records series, document recipient, media format, or many of the other important aspects of records description. In addition, these fields can be filled out by any user or not filled out, as suits their purposes. The metadata is not required by the software before the record is saved to computer disk or tape. Most individuals using computers today simply ignore the available fields of data that could be used to start an organized approach to filing electronic records. And if they do develop an approach to organizing their computer-based records, they are probably the only individuals that understand their system of description.
Some software presently offered on the market as "groupware" runs on local area networks and is beginning to offer tools that can be used to document the flow of information between computer users. This class of software promotes group work schedules, electronic mail, forms transmission, and rudimentary document image transmission. It is possible in this kind of networked computing environment to set up some systems controls that will identify aging records by types, designate records transmitted to certain recipients as vital, or offer standard document description tools to users. If these document life cycle controls are implemented, they can foster compliance with some established document creation and retention procedures. An attempt to justify the use of these tools to users and managers will encounter some challenges. However, it is possible through these communications to infuse some document management principles back into the overall life cycle of records creation, transmission, and destruction. Solutions that have been proposed in the past such as computer filing guidelines for single user computers, have seen few success stories, as there is not an easy way to monitor compliance or maintain contact with the isolated users.
METHODOLOGY DEVELOPMENTS
Over the last few years, the issues related to protecting and managing electronically stored data have begun to leak into the public press.(7,8) Although these issues have been presented and debated for years by records managers and archivists, there is a new interest in this area that may finally allow the development of some integrated methodologies for information management within automated systems. As general management and computer systems users begin to realize that they are risking the loss of years of productive labor every time a computer system's information exists without proper management and systems controls, they will become increasingly interested in supporting any efforts that can protect their information assets.
There are several initiatives that should be followed to benefit from ongoing research and development efforts. These activities will provide answers to today's questions about how to manage electronically recorded information. Such efforts include research at the University of Pittsburgh conducted by David Bearman and Richard Cox and funded by the National Historical Publications and Records Commission. This work is directed at defining the "content and importance of recordkeeping functional requirements for archivists, records managers, and other information management professionals working with electronic recordkeeping systems."(9) Records must be tracked in a manner that enables compliance with organizational directives, assurance of record integrity, and adequate capturing of information to support the identification, preservation, accessibility, and usefulness of the records. Another effort to define which data elements can be used to describe information objects is being led by Bruce K. Rosen of the National Institute of Standards and Technology (NIST). This interest centers around a need for possible "development of a Federal Information Processing Standard for the data elements--their identification, representation, arrangement, and object binding."(10) Their focus is to create of a set of data elements that will comprise a Record Description Record (RDR). NIST hopes that by "applying the standard to document management or object repository software products, it will become possible to use these products, to manage non-electronic records stored externally in addition to the electronic information objects stored in and under the control of the document management or repository products.' What this means is that if the RDR is sufficiently generic, it could be used as metadata to describe all information without concern about the media upon which the data are stored.
Another investigation that parallels these activities is the interest on the part of the Department of Defense Records Management Task Force to write a Request for Proposal that addresses standards for procurement of Electronic Records Management software. There are already identified 47 functional requirements that will determine the operating functionality of this software. These include what activities will be required to destroy records, transfer records, and search for records, as well as some record attributes (as in data elements) to identify appropriate filing categories. This software basically is intended to automate many records management functions.
A MOVING TARGET
What do these developments mean for records managers? The form and content of business records and how they are to be managed is changing daily. The development of sophisticated means of assuring that both electronic and paper-based documents can be tracked and managed more accurately will be a tremendous aid to enable records managers to perform their activities more professionally and more thoroughly. Records managers who understand how these concepts are important and can apply them within their work settings will be highly valued as a part of an increasingly electronic workplace. Records managers who expect to continue functioning as document "custodians" by physically storing and transporting those records could find their jobs automated out of existence. There are simply too many business incentives to stop the race to an eventual paperless office.
Understanding concepts such as these is vital to the professional futures of records managers. As information systems are modeled, designed, and built, records managers must understand these terms and concepts to be able to participate in creating the systems that they will be using to perform their work. "Models are an abstraction device. They help us differentiate the forest from the trees. As such, they must be sufficiently powerful to give us some understanding of the objects in our world and how they relate. Models are not used just to enhance our understanding and abstract detail. When automated, they define our reality."(11)
In addition, the principles that are used by all information managers are beginning to merge. Records are becoming "virtual documents," business procedures are being programmed into automated systems, and there is an expectation that all information must be managed according to the same standards in an integrated manner. This is all well and good for the enterprise that produces goods and services to pay salaries. It will also be good for flexible information professionals who want to take advantage of new opportunities to learn how they can best support the new technology driven environments in which they work and live.
REFERENCES
1. David Allen Pollock, "Data Elements: A Comprehensive Approach--Part 1," Data Base Newsletter, 14.(3), May/June 1986, p. 7.
2. George McDaniel, IBM Dictionary of Computing, McGraw-Hill, Inc., New York, August, 1993, p.169.
3. Pollock, p.8.
4. James Martin, Strategic Data Planning Methodologies, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1982, p.127.
5. McDaniel, p.431.
6. Charles J. Wertz, The Data Dictionary: Concepts and Uses, QED Information Sciences, Inc., Wellesley, Massachusetts, 1989, p. 69.
7. Jeff Rothenberg, "Ensuring the Longevity of Digital Documents," Scientific American, 272(1), January 1995, p.42-47.
8. Terry Cook, "It's 10 O'Clock: Do You Know Where Your Data Are?" Technology Review, January 1995, 98(1), p.48-53.
9. Richard J. Cox, University of Pittsburgh Recordkeeping Functional Requirements Project: Reports and Working Papers, University of Pittsburgh, Pittsburgh, PA, September 1994.
10. Notices, National Institute of Standards and Technology, Federal Register, 60(39), February 28, 1995, p.10833.
11. James Martin and James J. Odell, Object-Oriented Analysis and Design, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1992, p. 489.
Copyright Association of Records Managers and Administrators Inc. Oct 1995
Provided by ProQuest Information and Learning Company. All rights Reserved