Table of Contents:
|
|
|
Purpose |
|
In 2005, the document handling was
a $19 billion industry (from IDC),
but, (from ImageTag) "Current
paper-to-digital solutions capture
less than 1% of the paper headed
for the file cabinet." Digital
documents have cost, access, speed,
organization, durability, efficiency,
environmental, competitive,
discovery, and other compelling
advantages over paper documents, but
conventional technology will offer
either a good image or compression
but not both simultaneously which
leaves no feasible solution. In this
paper, we will analyze and
demonstrate the methods needed to get
both quality images and unprecedented
compression at the same time.
|
|
|
Introduction |
|
Paper-based filing systems are
several thousand years old and
found in every office across the
world, yet digital documents could
exude many clear advantages over
paper-based filing systems if they
simultaneously exhibited good
compression and quality images.
|
|
|
| 1. |
Cost: Digital documents are
nearly free, and they are getting
cheaper with each passing day.
Hard and other disk drives keep
increasing in capacity, but their
prices remain the same. With state
of the art compression, such as
Pac-n-Zoom, a 10 page document
only requires about 10 KBytes
which is 1 KByte per page. A 250
GByte hard drive currently costs
about $80.00. Then, 1 page can be
stored for 80.00 / 250G * 1K = 32
millionths of a cent, and it can
be backed up for much less than
that. By comparison, it costs
about .8 cents for the paper and
another 2.4 cents for the ink to
print a sheet of paper. This means
that it costs about (.8 + 2.4) /
.000032 or 100,000 times as
much to print a page as it does to
write it to a disk. Copying all of
these papers takes an average of
3% of all revenue generated by US
businesses. If we assume 20%
margins, then office copies cost
us 15% of our income. Besides
having numerous other competitive
advantages, digital documents are
much cheaper.
|
| 2. |
Access: The value of a
document is realized when someone
views it. The cost of copying a
digital document is less than a
billionth of a cent, and the
copies can be simultaneously
shipped at the speed of light to
most places in all 7 continents.
With the mobile and far-flung work
force of today, it is increasingly
unlikely that the needed document
will be in the same office as the
employee who needs it. There are
many times when files need to
viewed by people outside the
company. Digital storage can
provide cheap and manageable
access to the IRS, SEC, or civil
subpoenas. By providing access to
vendors and other outsourcing
agents, many processes can be
optimized. Digital documents can
be served to almost anyone in
anyplace for next to nothing.
|
| 3. |
Speed: Any accountant can
tell you that time is money. In a
well designed system it takes less
than two seconds to click on a link
and retrieve a digital document,
but on the average, it takes about
20 minutes to search, retrieve,
deliver, and re-file a document.
There is no need to put the
digital document back. When these
facts are taken together, digital
files are about 600 times faster
then paper files even when paper
files are in an ideal setting
which is an increasingly bad
assumption. Faster access equals a
quicker response to the customer's
needs which translates to greater
profits.
|
| 4. |
Organization: Even if a
filing clerk is able to label the
files in a way that the files can
be found, there is almost no
chance that the appropriate and
necessary cross-references are
included. Very few documents have
the luxury of starting at the
beginning and telling the whole
story through to the finish. Most
documents are a thread in a
tapestry of thought, and without
the appropriate cross references,
most paper documents are out of
context. The author of a digital
document can link in cross
references, and the organization
is not left to the filing clerk
who probably doesn't know what they
should be. Without these vital
links, 60% of employees waste more
than 60 minutes each day by
duplicating work that has already
been done (from
www.openarchive.com). Paper files
are so much trouble that the
average disorganized office
manager has 3,000 documents just
"lying around" (from a study by US
News and Report). The costs from
late fees, premium prices, and
other chaos expenses can eat up to
20% of the entire budget. 80% of a
company's information exists as
unstructured data that is
scattered throughout the
enterprise (from KMWorld). The
paper files are ironically the
most disorganized part of many
organizations. Digital files are
easier to find, integrate, and
organize.
|
| 5. |
Durable: (From
World-scan.com) "More than 70% of
today's business would fail within
3 weeks if they suffered a
catastrophic loss of paper-based
records due to fire or flood."
For their part, digital documents
tend to survive these unforeseen
events. For example, nearly all
the paper documents were lost but
almost all the digital documents
survived the World Trade Center
tragedy. It doesn't take a
catastrophe to lose paper. 22% of
all documents get lost and 7.5%
are never found which wipes out
15% of all profits. For their
part, digital files are not
normally removed from the filing
cabinet which dramatically reduces
the chance of losing the document.
|
|
6. |
Efficiency: In many cases,
it easier to outsource some
corporate function than it is to
"reinvent the wheel" in-house. The
primary problem of small companies
(according to the Boulder County
Business survey of companies in
Boulder County) is the handling of
government paperwork. The second
largest problem these companies
have (according to the same
survey) is the handling of
personnel. A company could
specialize in handling of
government paperwork or personnel
issues and be far more efficient
than everyone doing the same
things paperwork themselves, but
the specialized companies would
need access to the necessary files
which digital files allow. Digital
files allow managers to easily
verify the existence and accuracy
of all the paper trails which is
nearly impossible with paper
files. In these and many other
ways, digital files make a company
much more efficient than paper
files.
|
| 7. |
Environmental: In 1993 US
businesses used 2 million tons of
copy paper, and in 2000 this waste
grew to 4.6 million tons or more
than 92 billion sheets of paper
(Document Magazine). Since it
takes 17 trees to make a ton of
paper, the US used 78 million
trees worth of paper in 2000. To
make matters worth, the use of
paper is constantly and quickly
increasing. As shown above, the
use of paper more than doubled
within 7 years, and paper already
accounts for 40% of the municipal
solid waste stream. Digital copies
leave almost no environmental
scars.
|
| 8. |
Competitive: A winning team
plays together. While businesses
increasingly organize and automate
around the computer, paper
documents resist efforts to
increase productivity. When a
company automates paper processes,
they gain a clear advantage over
their competitors. Competitive
companies are trying to move
faster. For example, law firms
manage millions of pages of
documents, and it is imperative to
a court case that the right
documents and case files are
available to the right person at
the right time. To quote Ali
Shahidi of Alschuler, Grossman,
Stein, and Kahan LLP, a Santa
Monica law firm, "We're doing
things we couldn't have imagined a
few years ago. We're smarter,
better, and more nimble." At the
present time, 99% of the paper
documents can not be analyzed,
automated, and organized by a
computer. The paper part of the
office is a remnant from the last
century, and it doesn't allow a
company to move forward with the
modern techniques of the
information age.
|
| 9. |
Discovery: As the Internet
has proven, digital information is
the easiest information to find.
In a typical office, if an
important document is viewed by an
employee, there is significant
chance that the document will be
lost forever. The average white
collar worker spends 1 hour each
day looking for lost documents
(from Esselte). Since digital
documents are not removed from the
filing cabinet, they are unlikely
to be lost. The searcher may not
know what label the file was filed
under. With the ability to
contain links, digital documents
are easier to cross reference and
maintain context. The text of most
digital documents can be
recognized by the computer which
allows the searcher to search for
phrases inside the document.
Digital documents are easier to
find than paper files because
digital documents are unlikely to
be lost and digital text is
searchable by a computer.
|
|
|
|
Paper-based filing systems are
several thousand years old and
found in every office across the
world, yet digital documents could
exude many clear advantages over
paper-based filing systems if they
simultaneously exhibited good
compression and quality images.
As we have shown, digital files
have many advantages over paper,
but only about 1% of the files in
the filing cabinet have been
digitized. With current
technology, it is not practical to
digitize a large percentage of the
documents. Users can get a good
image (e.g., JPEG) or good
compression (e.g., TIFF G4), but
they can not get both a high
quality image and small file size
at the same time.
If the text is big and black and if the
paper is clean and white (with no
writing), a
threshold segmenter followed
by a
statistical compressor, such
as TIFF G4, yields file sizes that are useable
(if not a little annoying) on a
LAN.
Threshold segmentation is not the,
"silver bullet" people need, and
the files image to a FAX-like
quality. If we were looking at the
yellow carbon copy of a receipt,
much of the information would be
lost. With threshold segmentation,
TIFF G4 has adequate compression
but poor image quality.
JPEG has moderately good image
quality because it skips
segmentation altogether, and
simply performs a
discreet cosine
transform (with
huffman
encoding) on
the image. The quality comes with
a price. Since all the noise is
left in the image, the compression
is relative small. The user will
have to wait long times for the
file to serve, transmit, and load.
If the file was being transmitted
over the Internet, the user could
easily be waiting 10s of seconds
to view a single page. Without
segmentation, JPEG has a good
image but poor compression.
To move the files from the filing
cabinet to the computer, we need a
solution that has both a good
image and high compression.
|
|
|
Human-Like Segmentation |
|
Threshold segmentation is
usually not good enough to handle
the spectrum of document needs a
business has.
For example, a common receipt
might have blue printing on yellow
paper with a red number in the
upper right-hand corner. It is not
uncommon for some of the blue
printing to be very fine while the
issuing company's name is in big
bold print. People often write and
stamp on the receipt. The colors
they use don't have to conform to
those on the receipt. The receipt
might be a carbon copy with the
attendant image degradation. The
handling of the receipt might be
an issue. It might have gotten
dirty or crumpled, and these are
only a small sample of image
degradation possibilities.
A simple receipt can challenge
the computer's ability to store
the image without loss while
achieving useable compression, and
there are many applications (such
as X-rays, drawings, and others)
that are even more difficult.
Threshold segmentation is fast
and easy, but the quality of
segmentation often
falls short of what is needed to
do the job. In fact, the standard
that everyone is held to is
human-like segmentation. If people
can't see what is on the receipt,
then the receipt carries the
blame.
Few things shed blame as well as
humans. A compression project with
less segmentation than a human
will dam the river of blame and
divert it through the IT
department.
Maybe the accounts payable
person did write "paid" on the
invoice that was paid twice, or
maybe the necessary scrawl was
omitted. Either way the computer
will be blamed if it has a
reputation for missing such
things. To avoid all of these
accusations, the
document imaging system
needs to segment like a human (or
better).
This means that threshold
segmentation is not good enough
except in certain conditions where
image quality can be guaranteed.
In fact, the only acceptable
segmentation for most of the paper
in the office is human-like
segmentation. Humans set the
standard.
When the computer industry
started moving paper documents
onto the computer, segmentation
was used to create a higher
contrast (or sharper) image. A
segmented image is more easily
compressed, because there are
fewer artifacts to compress.
Segmentation removes many small
defects (we call noise) out of the
picture.
These are all convenient
reasons to use segmentation, but
unlike our conventional document
imaging methodology, humans
require segmentation to achieve
extraction.
When we use our senses to
perceive something, we are actually
performing several steps. Since we
use our senses so much, these
steps have become second nature to
us, and may even be performed in
the subconscious.
The first step, which we call
segmentation, is that of grouping
like shades together. Let us use a
black letter 'e' on a white
background as an example. The
black would probably be many
different shades up and down the
'e' (here is a
typical example of
a black 'e' scanned from a white
sheet of paper), but we would need
to group all of them together
before we could recognize the
letter 'e'.
While segmentation might seem
as if it adds artificiality to
the picture, recognition requires
segmentation. In other words, if
we want the picture to mean
anything to us, we will have to
segment it
in our head.
The process of finding a shape
(or any
feature) of
an object is called extraction.
The extracted shape is compared to
memory (probably a database but
isn't necessarily limited to
template matching). If
we can match the shape to a shape
stored in memory, we recognize the
shape. In our example, the
segmentation must bring the "e" in
as a single
region, or
we will extract a shape not
recognized by the database. In
other words, when parts of a
letter are missing, it is
difficult to identify the partial
letter.
The document imaging industry's
use of segmentation to achieve
OCR extraction
is similar, but threshold
segmentation is often a fatal
shortcut that prevents robust
extraction. For example, a
computer can not perform OCR
extraction with the accuracy a
human can in a multiple color
environment.
As computers get stronger, they
can afford to forsake threshold
segmentation for the more robust
edge detection segmentation that
is more intelligent and
human-like.
A computer with intelligent
human-like segmentation isn't
limited to more accurate
extraction. As the computer begins
to aggregate the image through
edge detection segmentation, it
can also achieve much better
compression than it could with
threshold segmentation.
To understand how these things
occur, let's backtrack to explain
the details of each type of
segmentation.
|
|
|
Threshold Segmentation |
|
Threshold segmentation is
the simplest type of segmentation.
In a simple example, if we had
black text on white paper, we
could set the threshold to gray.
All of the text that is darker
than the gray threshold would be
considered black, and any
background that was lighter than
the gray threshold would be
considered white. When the
segmentation is complete, there
are two colors in the picture.
Threshold segmentation can be much
more complicated than this. For
example, a
histogram of
colors could be taken across a
region or
an entire picture. The more
predominant colors can be
considered the foreground and
background. The threshold could be
set at the middle color between
the foreground and background
colors. Of course as threshold
segmentation becomes more
complicated, it runs slower.
In a standard 8.5 inch by 11 inch
sheet of paper scanned in at 300
dots per inch with all three
primary colors, we
have 8.5 * 11 * 300 * 300 * 3 or 25,245
KBytes of
data, if we assume 8 bits per
primary color. Even today's
computers take a noticeable amount
of time to chew through 25
MBytes
of data. Therefore, threshold
segmentation (typically a simple
variety) is usually used.
Threshold segmentation should
be considered a fast but coarse
segmentation. Threshold
segmentation does not address a
number of segmentation problems
required by human-like
segmentation.
For example,
transition distortions
are usually the largest distortion
introduced by a scanner, and
threshold segmentation does not
find or rebuild the
border. In
many cases, a pattern of colors is
mixed together to provide some
information (for example a logo),
and threshold segmentation will
not be able to sort through the
color complexities with human-like
intelligence.
|
|
|
Edge Detection Segmentation |
|
To
segment like
a human, the segmenter needs to
mimic the human process of edge
detection segmentation. For about
30 years, people have been trying
to achieve high quality with
edge detection segmentation, but
it has only recently been
accomplished.
Edge detection segmentation
does a better job of supporting
image restoration. The
edge of a
blob is
distorted in a variety of ways,
and the edge needs to be
discovered before it can be
restored.
In the past, most edge detection
segmentation was too coarse to be
used in
document imaging, but
computers can theoretically handle
edge detection better than humans.
Humans can only see about 200,000
colors, but computers typically
work with 16 million colors. The
optoelectronic sensing technology
is capable of many more.
When the computer does a better
job of segmenting, it can do a
better job compressing. If the
picture is over segmented, the
baby is thrown out with the bath
water. In other words, the data
washes out along with the noise.
With better segmentation, more
noise can be corrected while
leaving the data.
The input distortion from the
optoelectronic capture appliance
(usually a scanner) prevents high
levels of compression in color
documents; so a big size and
quality difference can currently
be found between color documents
created on a computer and those
scanned from a paper.
For example, a scanned color
document can be compressed about 3
times with a
statistical compressor, but
the same file created on the
computer could be compressed about
100 times. Furthermore, the
computer generated document would
be much cleaner and clearer.
Edge detection segmentation is a
much more complicated and computer
intensive segmentation technique.
At first, edge detection might
seem simple, but it is complicated
be several factors.
|
|
|
| 1. |
Continuous Tone: We
may not be segmenting along a
clear
color transition. In
fact, the image could be
continuous tone with nearly
unlimited color variations.
|
| 2. |
Image Distortion: Clear
color transitions are usually
smeared 4 or 5
pixels in
two directions. Finer details
become blurry, and humans are
able to segment some of this.
Then the humans expect the
computer to segment at human
level.
|
| 3. |
Fine Artifacts: Text (of
a specific font) comes with a
finite set of artifacts, and
they have a minimum size.
Artifacts below the minimum
can be ignored in text, but
they must be segmented in
continuous tone.
|
|
|
|
As we would expect, edge detection
segmentation takes much longer than
threshold segmentation.
|
|
|