首页    期刊浏览 2025年12月27日 星期六
登录注册

文章基本信息

  • 标题:Adobe Acrobat Capture 2.0 - includes related article on product details
  • 作者:Robert J. Boeri
  • 期刊名称:Event DV
  • 印刷版ISSN:1554-2009
  • 出版年度:1998
  • 卷号:April 1998
  • 出版社:Online, Inc.

Adobe Acrobat Capture 2.0 - includes related article on product details

Robert J. Boeri

Whether you're doing cutting-edge "knowledge management" or simply populating IT an intranet site or document management system, at some point an unruly mass of paper stands between you and the fluid information infrastructure you seek. Scanning paper images into a document collection alone will not solve the problem. Although straightforward imaging preserves the look and feel of originals, the image files the process creates are too large and unwieldy to be easily searched. Optical Character Recognition (OCR), achieved via OCR tools like Adobe Acrobat Capture, is the solution to your problems.

OCR software processes scanned bitmapped images of documents and converts the text images to actual text. In real-life scanning applications, OCR is much more complex than converting images to text. OCR software often attempts to preserve rudimentary document structures like paragraphs and bullet lists. Preserving fonts or even text size is frequently beyond the ken of OCR software, and OCR systems usually ignore graphics or halftones.

When legacy text is in a common monospace font like Selectric Courier and pages are not skewed, OCR can often deliver degrees of accuracy as high as 99.99 percent, which translates to just a handful of errors per page. Smudges, fading text, rubber-stamped markings, signatures, and handwritten notes all conspire to lower the accuracy of OCR systems. With a typical real-world mix of page blemishes, OCR systems often deliver less than 90 percent accuracy. OCR systems also usually ignore graphics, and tend to disregard text outside the 8-to-24 point range. Throw in some complex page layouts and unusual fonts, and accuracy drops even lower.

With some caveats, Adobe Acrobat Capture raises the bar for OCR. Capture makes impressive strides in delivering documents in Adobe's popular Portable Document Format (PDF) that are essentially like the originals and can be searched, deployed on intranets, and managed in ways heretofore impossible. By producing files that look like the paper originals, regardless of their layout complexity, Acrobat Capture offers far more than the ordinary OCR software package.

A Capture plug-in bundled with Acrobat Exchange provides low-volume OCR to PDF. If you already own Adobe Acrobat software, you have a surprisingly robust preview of Adobe Capture. According to Adobe engineers, the Capture plug-in is nearly as accurate as the original release of Adobe's full Capture system. If a commercial-grade OCR operation producing PDF renditions of pages is what you need, you'll quickly see the value of moving up to the full Capture system.

WHAT'S NEW IN CAPTURE 2.0

Adobe Acrobat Capture 2.0 offers improved page recognition over version 1.0, recognizes documents in eight languages, and features automatic reduced resolution of images in documents. Improved page recognition means that documents with highly formatted content such as forms, tables, or irregular layout can be processed.

For the first time, Capture includes an application programming interface (API) and software development kit (SDK); these are available on the Capture CD-ROM media shipping with the product. Using the Capture API, customers, integrators, and third-party developers can now build integrated, customized solutions for OCR applications such as 24-hour FAX Server to PDF, network scanning, integration with document management systems, and high-volume conversion. Capture's API has allowed Cornerstone Imaging and Documentum, among others, to integrate Acrobat Capture into their products.

HOW CAPTURE WORKS, HOW IT TESTED

When you start Acrobat Capture, you are presented with the comprehensive main screen, which offers several options. Users can define three main process flows: scanning images to folders, scanning images and processing them, or processing images that have already been scanned. You can also define several folders to contain images that will be OCR'd; a watched folder, for example, will receive scanned images at any time from one or more scanners. If you set up Capture to run as a background task, Capture will OCR the image when a watched folder receives an image. The right panel of the main screen allows you to review OCR'd results (for interactive editing to correct errors), or output OCR'd results to PDF, HTML, or to one of many word processor formats. Trash-can icons also provide a quick way to delete image files and act as a convenient reminder to do so; even though RGB images presented to Capture cannot exceed 400 dots per inch (dpi), those images can get very large, in the tens of megabytes.

If you want to correct OCR errors, simply select the Preview output option. Select the resulting ".ACD" preview file and Acrobat Previewer becomes active, highlighting words the program suspects it may have OCR'd incorrectly. Previewer will then show you the original bitmapped image it had to work with, and allow you to correct the suspected text if you wish. Since Acrobat 3.0 PDF files can be edited with Acrobat Exchange's text touch-up tool, you can always correct these files later in Exchange. The PDF file retains the flagged suspected errors and bitmapped images.

Given the complex nature of OCR, gauging its accuracy is quite difficult. The Capture package was tested with samples including simple courier text (but with the page skewed), clipped newspaper articles, and poor-quality originals with complex page layouts. Capture used its unique page structure recognition ability to straighten out skewed pages and preserve complex layouts. Resulting OCR'd PDF files ranged from 1/10 to 1/100 the size of the original TIF images. For example, a scanned newspaper clipping was 4.7MB in size (as an uncompressed TIF), but once OCR'd and rendered into PDF, the file shrank to only 59KB.

IS ANYTHING WRONG WIN THIS PICTURE?

Dazzling OCR value that it is, even Capture 2.0 has areas for improvement. For example, Capture's conversion to word processing and HTML formats is disappointingly inaccurate. Tests bore out this conclusion: Acrobat Capture is more effective at generating Acrobat PDF files than it is at converting documents to HTML or word processing formats. For example, WordPerfect conversions converted nearly every block of text (including paragraphs) into text boxes instead of simple paragraphs. HTML conversions, however, were easily read and displayed by Internet Explorer, including correctly placed graphics. And of course, any OCR error the Capture server made would appear consistently in Acrobat or word processor renditions. Complex structures (like bullet lists or tables) were not converted correctly. However--in fairness to Capture--it is simply impossible to determine some underlying word processor structures. For example, text inside a box could have been entered as a text box or as a one-celled table.

Adobe has recently indicated that getting PDF renditions correct--and keeping them that way as the PDF format evolves--is its primary design goal. However, Adobe engineers say they know formats other than PDF need improvement and will become much better as the product evolves.

RELATED ARTICLE: Adobe Acrobat Capture 2.0

Synopsis: Adobe Acrobat Capture 2.0 raises the bar for Optical Character Recognition (OCR) software. Capture makes impressive strides in delivering Acrobat PDF results which are essentially like the originals and can be searched, deployed on intranets, and managed in ways heretofore impossible. By producing files that look like the paper originals, regardless of their layout complexity,, Acrobat Capture offers far more than the ordinary OCR software package.

Prices: $895 MSRP; includes one software license and the ability to OCR an initial 20,000 pages (counted by a mechanical dongle attached to the parallel port)

System Requirements: 486 or Pentium PC running Windows 95/NT, 3.51/NT, 4.0, with 32MB RAM, a CD-ROM drive, and parallel port

Supported scanners: Most popular scanners through Twain and ISIS drivers; automatic document feeders and double-sided pages up to 27" x 27"

For more information, contact: Adobe Systems, Inc. 345 Park Avenue, MSW 18, San Jose, CA 95110-2704; 408/536-6000; Fax 408/357-4004; http://www. adobe.com; INFOLINK #402

Robert J. Boeri (bboeri@world.std.com) is co-columnist for INFORMATION INSIDER and Information Systems Publishing Consultant at Factory Mutual Engineering of Norwood, Massachusetts.

Comments? Email the editor at letters@onlineinc.com or check the masthead for other ways to contact us.

COPYRIGHT 1998 Online, Inc.
COPYRIGHT 2000 Gale Group

联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有