The Creation of Golden Files

Setting the Configuration File Making the Graphic File Running Pac-n-Zoom® Setting the Cluster IDs
Setting the Configuration File:
While all parts of Pac-n-Zoom can use any of the three interfaces, to keep it simple, we will used samples from the configuration file (which are given with a blue background) to show which values need to be set. The configuration file uses the standard Pac-n-Zoom data format. There are five data segments in the configuration file that directly relate to the golden files.
# BLOB COMPRESSOR PARAMETERS
0; Acceptable Tolerance
1; Write Golden File Flag
*
There are two settings in the "BLOB COMPRESSOR PARAMETERS" segment that need to be correctly set to write the golden file.

1. Acceptable Tolerance: The first setting in the "BLOB COMPRESSOR PARAMETERS" segment is the acceptable tolerance. For normal results, this value should be set to 0. A higher value is more efficient, but nonidentical parts of different letters will be substituted for each other. While most people won't notice a value that is less than 3, it might bother some people.

2. Golden File Flag: The second setting in the "BLOB COMPRESSOR PARAMETERS" segment is a flag that determines whether the golden file should be wrote. If the GUI were being used, the golden file would be saved as a *.pzg file. If the flag is set (1), the golden file is wrote.
# CURRENT BC GOLDEN FILE NAMES
;PACNZOOM
*
This is a list of golden files that will be used to build the current golden files. In most cases, no golden files should be used to build additional golden files, and there are no file names in this list. In the example that is above, the file name is commented out.

If golden file names are given and the acceptable tolerance is 0 (as it should be), the only affect of these extra golden file names is to make the program take longer to create the additional golden file.

If golden file names are given and the acceptable tolerance is not zero, the additional golden file will contain morphological genetics that are within the acceptable tolerance of the given golden files.
# NEXT BC GOLDEN FILE NAME
PACNZOOM
*
This segment contains the name of the next golden file. In this example that is above, the next golden file will be named "PACNZOOM.pzg"
# NEXT BC GOLDEN FILE FLAGS
Painted File
*
Frame flags can be written, but the reading of frame flags is not currently supported. This segment contains the flags of the next golden file. In this example that is above, the next data frame will be flagged:
~ Painted File

Frame flags are a tool that can organize golden files. For example, only those frames flagged with "Painted File" would be read if the flag was used to select golden file frames.

Extra golden frames will slow down Pac-n-Zoom encoding, and the data could be snapped to an unintended golden figure.

If only one frame is used per file, the same result can be achieved with the segment header: # CURRENT BC GOLDEN FILE NAMES.
# BC GOLDEN TEXT
Times Roman 41 ver. 3
English
Bold
Italics
Underlined
*
The first line of this segment contains the name of the font. All the rest of the lines in this segment are attributes of the font. Instead of using the old pitch system, the number of vertical pixels in the tallest letter should be used to name the font. This convention, when combined with the size of the image, will produce more consistent results across different media and hold truer to the original document.

The information in this segment allows the figure to be recognized (i.e., OCR), and for the first time (I guess) the font and attributes can be recognized. While this might seem pretty handy, the same formatting in different programs can look different. In other words, porting formatting from one program to another is tricky.

It is still a good idea to fill this information in. Without it, there will probably be 50,000 undocumented golden files, and the time will likely come when some of these will be preferred over others. There won't be any good clear way to do that without this data segment.

Making the Graphic File:
Since so many golden files are needed, it is easier to create them automatically. During this process, care should be taken to make perfect characters. There should no stretching or scanning. If OCR is desired, the characters must be clustered to be identified. The characters must be replicated before they are clustered. After the golden file is wrote, the characters need to be identified. To make the identification process automatic, the file should be created with the characters in lexicographic (or better ASCII) order. The following example is a good form to follow.
! !
" "
# #
% %
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< <
= =
> >
? ?
@ @
A A
B B
C C
D D
E E
F F
G G
H H
I I
J J
K K
L L
M M
N N
O O
P P
Q Q
R R
S S
T T
U U
V V
W W
X X
Y Y
Z Z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
~ ~

Running Pac-n-Zoom:
If only several or fewer golden files need to be made (for example a special font the company might use), it is probably easiest to use the manual method. When a number of files need to be made, some sort of automation should be used.

I. Manual Method: The manual method contains four steps.

A. Create: The graphic file should be created with perfect fonts. A word process, text editor, paint program, or some other program with textual abilities should be used. The file should not be scanned in or captured with any sort of digital sensor, because there should be no noise in the image.

B. Print: The file should be printed or saved to a bitmap (*.bmp) file.

C. Configure: The configuration file should be set as shown above.

D. Run: Run Pac-n-Zoom and save the file as a golden (*.pzg) file.

II. Semiautomatic Method: When more than a few golden frames need to be created, it is probably easier to automate part of the process. If the program creating the graphic file has macros, the entire process could be automated, but it might take longer to get it working then the time it saves. The following provides a method for a semiautomatic solution.

A. Software: The following software programs are needed to use the suggested method.

1. Macro Recorder: If the program that creates the graphic file does not have macros, the process can still be automated by using a separate program to record the mouse and keyboard activity. The following list of programs is not exhaustive. It represents a fraction of a short Internet search.
a. Workspace Macro Pro - Automation Edition 5.5 from Tethys Solutions
b. Journal Macro from Chosen Software
c. EZ Macros 5.0a by American Systems
d. Eventcorder by CMS


2. Pac-n-Zoom:

3. Textual Program: This can be almost any program that can produce text. Some common utilities are Word Pad, Note Pad, and Paint. If the program uses macros, it will be easier to automate. It will be very useful if the program fonts have a selectable pitch, where the desired pitch can be typed in. Better results can be obtained with more pitch sizes.

4. Printer Driver: A virtual printer driver that can print to bitmap (*.bmp) will probably make the most accurate golden file. The more selectable resolutions of the virtual printer driver, the better results will be. A short Internet search turned up 3 virtual printer drivers that are listed here.
a. Zan Image Printer 4.0 from zan1011.com
b. Soft Copy 2.x from Dobysoft
c. KazStamp 9.0 from Kaczynski Software


5. Script Language: The script language will need to be able to read and modify a file. It will launch and terminate Pac-n-Zoom, and it will feed Pac-n-Zoom's process controller (not to be confused with the OS process controller) the remote commands that Pac-n-Zoom will execute.

There are many scripting languages that can do these things. We use Python, but Perl and others would work well.

B. Bitmaps: There are many different ways to make the graphic files. The following method is one of many.

1. List: In this method, we will make all the graphic files we need and keep a list of the file names and text properties that were used to build each file. This list should be kept in order and delimited, because some script will use the list to modify the Pac-n-Zoom configuration file.

2. Text: Put in the text as shown above.

3. Printer: Select the virtual printer driver that will print to a bitmap file, and set printer to the desired resolution. The printing and the scanning should be at the same resolution. For example, if the scanning is going to be at 300 dots per inch (DPI), the printing should be set to 300 DPI. This will help the attributes to be closer to the desired results.

4. Font: Set the font of the text and store to the list.

5. Pitch: Set the pitch of the text and store to the list.

6. Attribute: Set the attribute of the text and store to the list.

7. Name: Set the name of the printed file store it to the list.

8. Print: Save the image to a bitmap (*.bmp) file by printing the image.

9. Next: Start building the next graphic file by going back to step D).

C. Pac-n-Zoom: We are now ready to use a master script that will create the golden files by running Pac-n-Zoom.

Pac-n-Zoom obtains the attributes of the golden file frame from the Pac-n-Zoom configuration file which is only read when Pac-n-Zoom is launched. For each bitmap file, we will need to take these steps.

1. Read: Read the following from list that was created when building the bitmap files.

a. Bitmap: Read the name of the bitmap file.

b. Name: Read the name assigned to the golden file. The golden file can contain more than one text frame.

c. Font: Load the name of the font.

d. Pitch: Get the size of the font.

e. Attributes: Get the list of attributes.

f. Flags: Read the flags that will be set in this text frame.

2. Configure: The "# BLOB COMPRESSOR PARAMETERS", "# CURRENT BC GOLDEN FILE NAMES", and most other data segments don't change during the execution of the master script, but the following need to be set differently for each bitmap file.

a. Name: The "# NEXT BC GOLDEN FILE NAME" data segment should contain the name of the golden file with the extension ("*.pzg").

b. Flag: The "# NEXT BC GOLDEN FILE FLAGS" data segment should contain the flags desired in the golden frame.

c. Font: The first line of the "# BC GOLDEN TEXT" data segment should contain the font and pitch.

d. Language: The second line of the "# BC GOLDEN TEXT" data segment should contain the language being used.

e. Attributes: The third line and on of the "# BC GOLDEN TEXT" data segment should contain the remaining attributes of the font.

3. Launch: The master script launches Pac-n-Zoom.

4. Load: A remote command is used to load the bitmap.

5. Write: A remote command orders the golden file to be wrote. Since the configuration file is set, the program defaults to writing a golden file and to the golden file name.

6. Terminate: The master script terminates Pac-n-Zoom.

7. Return: As long as there are unprocessed bitmap files, go back to step A).


Setting the Cluster IDs:
While Pac-n-Zoom is able to group things into clusters within an acceptable tolerance, without using other golden files (which would be pointless) it does not have the ability to identify the clusters in a meaningful human-like way. Since a typical application might used several thousand golden files, the task of manually identifying each cluster would likely be long and tedious. If the golden file was made in the order shown above, a program can automatically identify the clusters by using the following format.

I. Data Segment: There are several data segments inside a font frame, and data segments can have any order. The cluster data segment needs to be found.

A. Opening: Data segments are opened when no other data segment is opened and when a '#' character is the first character of the line.

B. Closing: A data segment is closed (assuming it is opened) when a '*' character is the first character of the line.

C. Indentification: The cluster data segment is identified with "# Clusters". To indentify the cluster segment, the following checks should be made.

1. Open: A check should be made that no other segments are opened.

2. '#': The program should check that '#' is the first character of the line.

3. " Clusters": " Clusters" should follow the '#' found in step B).

II. Cluster Row: A cluster row consists of two parts and has the following characteristics.

A. Preamble: When the golden file is initially written, all of the preambles are unidentified. The identifying program changes the cluster row preamble to one of the identified formats. The cluster row preamble has the following formats.

1. Unidentified: An unidentified preamble has the following format.

U| Height | Width | Column | Row |

a. U: Stands for unidentified

b. Height: The maximum height of the cluster in pixels

c. Width: The maximum width of the cluster in pixels

d. Column: The column of the initial pixel

e. Row: The row of the initial pixel

2. Text: A text preamble has 1 field that contains the text character

3. Graphic: A graphic preamble has the following format. The graphic can be any shape.

| Name | Height | Width |

a. Name: The name of the cluster that was manually inserted.

b. Height: The maximum height of the cluster in pixels.

c. Width: The maximum width of the cluster in pixels.

B. Frame: The cluster frame (not to be confused with the font or data frame) follows the last vertical bar, '|'.

C. Line: With no exceptions, there is one cluster row on each line of data in the cluster data segment.

D. Samples:

1. Unidentified: U| 41| 21| 37|895|5A 8C E1 0A 23 32 33 18

2. Unidentified: U|107|410|155|962|5A B4 01 22 04 B7 A0 1

3. Text: |a|5A 8C E1 0A 23 32 33 18

4. Graphic: |School|107|410|5A B4 01 22 04 B7 A0 1

III. Identification Strategy: The objective of the software is to accurately convert the unidentified clusters into text clusters. This process is not as simple as treating each cluster row as another character.

A. Blob: A blob is a group of pixels that meet the following conditions.

1. Color: All the pixels in a blob are the same color.

2. Adjacent: Each pixel in the blob touches at least on other pixel in the blob.

B. Cluster: When two blobs are within the acceptable tolerance (in the case of golden files, the acceptable tolerance should be zero) of each other, they form a cluster. This is why we put two identical characters on each row to build a golden file image.

1. Cluster Row: Each cluster row contains a cluster, but a cluster is not necessarily a character. For example, the letter 'i' has both a dot and body clusters.

2. Super Cluster: While retaining their cluster status, the dot and body of the 'i' also fold together to form a new cluster which is the character 'i'. A super cluster is the folding of two or more clusters.

C. Spacing: By definition, a font of a certain pitch will have an exact number of text rows per a given length. This rule can be relied upon to determine which cluster row contains the cluster that holds the text character.

D. Beginning: In an unidentified cluster, the third and fourth fields respectively hold the row and column of the initial pixel of the cluster. The initial pixel is the top or most northern pixel. If there is more than one top pixel, the most left or western pixel is the initial pixel.

E. Order: The super clusters are ordered after their cluster components. For example, while the clusters are ordered in the scan (most north then most west is first) order, in the letter 'i', the body of the 'i' is ordered before the super cluster that is the character 'i'.

Then, the last cluster whose beginning ( D. from above) falls within the spacing ( C. from above) is the super cluster that needs to be changed from an unidentified cluster ( II.A.1. from above) to a text cluster ( II.A.2. from above).

IV. Format Exceptions: The following text symbol could create problems or confusion.

A. 'U': The first letter of an undefined cluster is the letter 'U', but the first letter of a text or graphic cluster that has a class of 'U' is a vertical bar.

B. '|': The program reads "|||" as an alternative for a line return. Therefore the class of a graphic or text cluster that is name '|' should use a space which is deleted any way. The following example illustrates the method.

# Clusters
| ||5A 8C E1 0A 23 32 33 18
*

C. " |": A blank is ignored, because a blank, ' ', can not be a cluster. In other words, the following cluster is the same as '|' which was given above. To put this yet another way, '|' is the same cluster as " |".

# Clusters
| ||5A 8C E1 0A 23 32 33 18
*

V. Sample Frame: The following sample provides some clarity into the cluster identification process. The clusters will have no meaning without the rest of the frame.

A. Unidentified:
{
~ Painted File

# Media File Name
TimRom41.pnh
*

# Media File Size
01 C1; Width
02 9E; Height
*

# TEXT
Times Roman 41
English
*

# Shapes
EE DF 40 00 00 00 00 00
FE EF F0 F0 0F 00 00 0B 0
*

# Borders
C0 03 6D CC C
F6 54 00 A4 08 81 0
*

# Cluster Patterns
9
8
5
*

# Clusters
U| 41| 21| 37|895|5A 8C E1 0A 23 32 33 18
U|107|410|155|962|6A 8C C1 8E 29 82 B3 84 34 5
*
}

B. Identified:
{
~ Painted File

# Media File Name
TimRom41.pnh
*

# Media File Size
01 C1; Width
02 9E; Height
*

# TEXT
Times Roman 41
English
*

# Shapes
EE DF 40 00 00 00 00 00
FE EF F0 F0 0F 00 00 0B 0
*

# Borders
C0 03 6D CC C
F6 54 00 A4 08 81 0
*

# Cluster Patterns
9
8
5
*

# Clusters
|a|5A 8C E1 0A 23 32 33 18
|b|6A 8C C1 8E 29 82 B3 84 34 5
*
}


Back to Raster Files Back to Golden Files