Document Handling White Paper

Page: 1

Page: 2
Page: 3


Table of Contents:

Purpose

Introduction

Human-Like Segmentation

Threshold Segmentation

Edge Detection Segmentation

Electronic Documents

Current Implementations

Emerging Technology

Test Results
Black and White Text
Document v Images
Original Image
After Segmentation
TIFF G4 Distortion
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument® : LuraDocument is a registered trademark of LuraTech.
djvu
Test Notes

Colored Text
Document e Images
Original Image
After Segmentation
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument
djvu
Test Notes

Document r Images
Original Image
After Segmentation
TIFF G4 Distortion
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument
djvu
Test Notes

Document p Images
Original Image
After Segmentation
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument
djvu
Test Notes

Document m Images
Original Image
After Segmentation
TIFF G4 Distortion
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument
djvu
Test Notes

Photographs
Photo ros Image
Original Image
After Segmentation
High Resolution JPEG
Medium Resolution JPEG
Low Resolution JPEG
TIFF G4
LuraDocument
djvu
Test Notes



Purpose

    In 2005, the document handling was a $19 billion industry (from IDC), but, (from ImageTag) "Current paper-to-digital solutions capture less than 1% of the paper headed for the file cabinet." Digital documents have cost, access, speed, organization, durability, efficiency, environmental, competitive, discovery, and other compelling advantages over paper documents, but conventional technology will offer either a good image or compression but not both simultaneously which leaves no feasible solution. In this paper, we will analyze and demonstrate the methods needed to get both quality images and unprecedented compression at the same time.


Introduction

    Paper-based filing systems are several thousand years old and found in every office across the world, yet digital documents could exude many clear advantages over paper-based filing systems if they simultaneously exhibited good compression and quality images.

1. Cost: Digital documents are nearly free, and they are getting cheaper with each passing day. Hard and other disk drives keep increasing in capacity, but their prices remain the same. With state of the art compression, such as Pac-n-Zoom, a 10 page document only requires about 10 KBytes which is 1 KByte per page. A 250 GByte hard drive currently costs about $80.00. Then, 1 page can be stored for 80.00 / 250G * 1K = 32 millionths of a cent, and it can be backed up for much less than that. By comparison, it costs about .8 cents for the paper and another 2.4 cents for the ink to print a sheet of paper. This means that it costs about (.8 + 2.4) / .000032 or 100,000 times as much to print a page as it does to write it to a disk. Copying all of these papers takes an average of 3% of all revenue generated by US businesses. If we assume 20% margins, then office copies cost us 15% of our income. Besides having numerous other competitive advantages, digital documents are much cheaper.

2. Access: The value of a document is realized when someone views it. The cost of copying a digital document is less than a billionth of a cent, and the copies can be simultaneously shipped at the speed of light to most places in all 7 continents. With the mobile and far-flung work force of today, it is increasingly unlikely that the needed document will be in the same office as the employee who needs it. There are many times when files need to viewed by people outside the company. Digital storage can provide cheap and manageable access to the IRS, SEC, or civil subpoenas. By providing access to vendors and other outsourcing agents, many processes can be optimized. Digital documents can be served to almost anyone in anyplace for next to nothing.

3. Speed: Any accountant can tell you that time is money. In a well designed system it takes less than two seconds to click on a link and retrieve a digital document, but on the average, it takes about 20 minutes to search, retrieve, deliver, and re-file a document. There is no need to put the digital document back. When these facts are taken together, digital files are about 600 times faster then paper files even when paper files are in an ideal setting which is an increasingly bad assumption. Faster access equals a quicker response to the customer's needs which translates to greater profits.

4. Organization: Even if a filing clerk is able to label the files in a way that the files can be found, there is almost no chance that the appropriate and necessary cross-references are included. Very few documents have the luxury of starting at the beginning and telling the whole story through to the finish. Most documents are a thread in a tapestry of thought, and without the appropriate cross references, most paper documents are out of context. The author of a digital document can link in cross references, and the organization is not left to the filing clerk who probably doesn't know what they should be. Without these vital links, 60% of employees waste more than 60 minutes each day by duplicating work that has already been done (from www.openarchive.com). Paper files are so much trouble that the average disorganized office manager has 3,000 documents just "lying around" (from a study by US News and Report). The costs from late fees, premium prices, and other chaos expenses can eat up to 20% of the entire budget. 80% of a company's information exists as unstructured data that is scattered throughout the enterprise (from KMWorld). The paper files are ironically the most disorganized part of many organizations. Digital files are easier to find, integrate, and organize.

5. Durable: (From World-scan.com) "More than 70% of today's business would fail within 3 weeks if they suffered a catastrophic loss of paper-based records due to fire or flood." For their part, digital documents tend to survive these unforeseen events. For example, nearly all the paper documents were lost but almost all the digital documents survived the World Trade Center tragedy. It doesn't take a catastrophe to lose paper. 22% of all documents get lost and 7.5% are never found which wipes out 15% of all profits. For their part, digital files are not normally removed from the filing cabinet which dramatically reduces the chance of losing the document.

6. Efficiency: In many cases, it easier to outsource some corporate function than it is to "reinvent the wheel" in-house. The primary problem of small companies (according to the Boulder County Business survey of companies in Boulder County) is the handling of government paperwork. The second largest problem these companies have (according to the same survey) is the handling of personnel. A company could specialize in handling of government paperwork or personnel issues and be far more efficient than everyone doing the same things paperwork themselves, but the specialized companies would need access to the necessary files which digital files allow. Digital files allow managers to easily verify the existence and accuracy of all the paper trails which is nearly impossible with paper files. In these and many other ways, digital files make a company much more efficient than paper files.

7. Environmental: In 1993 US businesses used 2 million tons of copy paper, and in 2000 this waste grew to 4.6 million tons or more than 92 billion sheets of paper (Document Magazine). Since it takes 17 trees to make a ton of paper, the US used 78 million trees worth of paper in 2000. To make matters worth, the use of paper is constantly and quickly increasing. As shown above, the use of paper more than doubled within 7 years, and paper already accounts for 40% of the municipal solid waste stream. Digital copies leave almost no environmental scars.

8. Competitive: A winning team plays together. While businesses increasingly organize and automate around the computer, paper documents resist efforts to increase productivity. When a company automates paper processes, they gain a clear advantage over their competitors. Competitive companies are trying to move faster. For example, law firms manage millions of pages of documents, and it is imperative to a court case that the right documents and case files are available to the right person at the right time. To quote Ali Shahidi of Alschuler, Grossman, Stein, and Kahan LLP, a Santa Monica law firm, "We're doing things we couldn't have imagined a few years ago. We're smarter, better, and more nimble." At the present time, 99% of the paper documents can not be analyzed, automated, and organized by a computer. The paper part of the office is a remnant from the last century, and it doesn't allow a company to move forward with the modern techniques of the information age.

9. Discovery: As the Internet has proven, digital information is the easiest information to find. In a typical office, if an important document is viewed by an employee, there is significant chance that the document will be lost forever. The average white collar worker spends 1 hour each day looking for lost documents (from Esselte). Since digital documents are not removed from the filing cabinet, they are unlikely to be lost. The searcher may not know what label the file was filed under. With the ability to contain links, digital documents are easier to cross reference and maintain context. The text of most digital documents can be recognized by the computer which allows the searcher to search for phrases inside the document. Digital documents are easier to find than paper files because digital documents are unlikely to be lost and digital text is searchable by a computer.

    Paper-based filing systems are several thousand years old and found in every office across the world, yet digital documents could exude many clear advantages over paper-based filing systems if they simultaneously exhibited good compression and quality images.

    As we have shown, digital files have many advantages over paper, but only about 1% of the files in the filing cabinet have been digitized. With current technology, it is not practical to digitize a large percentage of the documents. Users can get a good image (e.g., JPEG) or good compression (e.g., TIFF G4), but they can not get both a high quality image and small file size at the same time.

    If the text is big and black and if the paper is clean and white (with no writing), a threshold segmenter followed by a statistical compressor, such as TIFF G4, yields file sizes that are useable (if not a little annoying) on a LAN. Threshold segmentation is not the, "silver bullet" people need, and the files image to a FAX-like quality. If we were looking at the yellow carbon copy of a receipt, much of the information would be lost. With threshold segmentation, TIFF G4 has adequate compression but poor image quality.

    JPEG has moderately good image quality because it skips segmentation altogether, and simply performs a discreet cosine transform (with huffman encoding) on the image. The quality comes with a price. Since all the noise is left in the image, the compression is relative small. The user will have to wait long times for the file to serve, transmit, and load. If the file was being transmitted over the Internet, the user could easily be waiting 10s of seconds to view a single page. Without segmentation, JPEG has a good image but poor compression.

    To move the files from the filing cabinet to the computer, we need a solution that has both a good image and high compression.


Human-Like Segmentation

    Threshold segmentation is usually not good enough to handle the spectrum of document needs a business has.

    For example, a common receipt might have blue printing on yellow paper with a red number in the upper right-hand corner. It is not uncommon for some of the blue printing to be very fine while the issuing company's name is in big bold print. People often write and stamp on the receipt. The colors they use don't have to conform to those on the receipt. The receipt might be a carbon copy with the attendant image degradation. The handling of the receipt might be an issue. It might have gotten dirty or crumpled, and these are only a small sample of image degradation possibilities.

    A simple receipt can challenge the computer's ability to store the image without loss while achieving useable compression, and there are many applications (such as X-rays, drawings, and others) that are even more difficult.

    Threshold segmentation is fast and easy, but the quality of segmentation often falls short of what is needed to do the job. In fact, the standard that everyone is held to is human-like segmentation. If people can't see what is on the receipt, then the receipt carries the blame.

    Few things shed blame as well as humans. A compression project with less segmentation than a human will dam the river of blame and divert it through the IT department.

    Maybe the accounts payable person did write "paid" on the invoice that was paid twice, or maybe the necessary scrawl was omitted. Either way the computer will be blamed if it has a reputation for missing such things. To avoid all of these accusations, the document imaging system needs to segment like a human (or better).

    This means that threshold segmentation is not good enough except in certain conditions where image quality can be guaranteed. In fact, the only acceptable segmentation for most of the paper in the office is human-like segmentation. Humans set the standard.

    When the computer industry started moving paper documents onto the computer, segmentation was used to create a higher contrast (or sharper) image. A segmented image is more easily compressed, because there are fewer artifacts to compress. Segmentation removes many small defects (we call noise) out of the picture.

    These are all convenient reasons to use segmentation, but unlike our conventional document imaging methodology, humans require segmentation to achieve extraction.

    When we use our senses to perceive something, we are actually performing several steps. Since we use our senses so much, these steps have become second nature to us, and may even be performed in the subconscious.

    The first step, which we call segmentation, is that of grouping like shades together. Let us use a black letter 'e' on a white background as an example. The black would probably be many different shades up and down the 'e' (here is a typical example of a black 'e' scanned from a white sheet of paper), but we would need to group all of them together before we could recognize the letter 'e'.

    While segmentation might seem as if it adds artificiality to the picture, recognition requires segmentation. In other words, if we want the picture to mean anything to us, we will have to segment it in our head.

    The process of finding a shape (or any feature) of an object is called extraction. The extracted shape is compared to memory (probably a database but isn't necessarily limited to template matching). If we can match the shape to a shape stored in memory, we recognize the shape. In our example, the segmentation must bring the "e" in as a single region, or we will extract a shape not recognized by the database. In other words, when parts of a letter are missing, it is difficult to identify the partial letter.

    The document imaging industry's use of segmentation to achieve OCR extraction is similar, but threshold segmentation is often a fatal shortcut that prevents robust extraction. For example, a computer can not perform OCR extraction with the accuracy a human can in a multiple color environment.

    As computers get stronger, they can afford to forsake threshold segmentation for the more robust edge detection segmentation that is more intelligent and human-like.

    A computer with intelligent human-like segmentation isn't limited to more accurate extraction. As the computer begins to aggregate the image through edge detection segmentation, it can also achieve much better compression than it could with threshold segmentation.

    To understand how these things occur, let's backtrack to explain the details of each type of segmentation.



Threshold Segmentation

    Threshold segmentation is the simplest type of segmentation. In a simple example, if we had black text on white paper, we could set the threshold to gray. All of the text that is darker than the gray threshold would be considered black, and any background that was lighter than the gray threshold would be considered white. When the segmentation is complete, there are two colors in the picture.

    Threshold segmentation can be much more complicated than this. For example, a histogram of colors could be taken across a region or an entire picture. The more predominant colors can be considered the foreground and background. The threshold could be set at the middle color between the foreground and background colors. Of course as threshold segmentation becomes more complicated, it runs slower.

    In a standard 8.5 inch by 11 inch sheet of paper scanned in at 300 dots per inch with all three primary colors, we have 8.5 * 11 * 300 * 300 * 3 or 25,245 KBytes of data, if we assume 8 bits per primary color. Even today's computers take a noticeable amount of time to chew through 25 MBytes of data. Therefore, threshold segmentation (typically a simple variety) is usually used.

    Threshold segmentation should be considered a fast but coarse segmentation. Threshold segmentation does not address a number of segmentation problems required by human-like segmentation.

    For example, transition distortions are usually the largest distortion introduced by a scanner, and threshold segmentation does not find or rebuild the border. In many cases, a pattern of colors is mixed together to provide some information (for example a logo), and threshold segmentation will not be able to sort through the color complexities with human-like intelligence.



Edge Detection Segmentation

    To segment like a human, the segmenter needs to mimic the human process of edge detection segmentation. For about 30 years, people have been trying to achieve high quality with edge detection segmentation, but it has only recently been accomplished.

    Edge detection segmentation does a better job of supporting image restoration. The edge of a blob is distorted in a variety of ways, and the edge needs to be discovered before it can be restored.

    In the past, most edge detection segmentation was too coarse to be used in document imaging, but computers can theoretically handle edge detection better than humans. Humans can only see about 200,000 colors, but computers typically work with 16 million colors. The optoelectronic sensing technology is capable of many more.

    When the computer does a better job of segmenting, it can do a better job compressing. If the picture is over segmented, the baby is thrown out with the bath water. In other words, the data washes out along with the noise. With better segmentation, more noise can be corrected while leaving the data.

    The input distortion from the optoelectronic capture appliance (usually a scanner) prevents high levels of compression in color documents; so a big size and quality difference can currently be found between color documents created on a computer and those scanned from a paper.

    For example, a scanned color document can be compressed about 3 times with a statistical compressor, but the same file created on the computer could be compressed about 100 times. Furthermore, the computer generated document would be much cleaner and clearer.

    Edge detection segmentation is a much more complicated and computer intensive segmentation technique. At first, edge detection might seem simple, but it is complicated be several factors.

1. Continuous Tone: We may not be segmenting along a clear color transition. In fact, the image could be continuous tone with nearly unlimited color variations.

2. Image Distortion: Clear color transitions are usually smeared 4 or 5 pixels in two directions. Finer details become blurry, and humans are able to segment some of this. Then the humans expect the computer to segment at human level.

3. Fine Artifacts: Text (of a specific font) comes with a finite set of artifacts, and they have a minimum size. Artifacts below the minimum can be ignored in text, but they must be segmented in continuous tone.

    As we would expect, edge detection segmentation takes much longer than threshold segmentation.


Page: 1

Page: 2 - Go to next page
Page: 3

Home

2004 - 2008 Accelerated I/O, Inc, All rights reserved.