STEPS IN THE PROCESS OF DIGITIZATION - Developing Sustainable Digital Libraries

The following four steps are involved in the process of digitization.

1. Scanning

2. Document image processing (DIP), 3. Electronic Filing System (EFS) and 4. Document Management Systems (DMS)

provides all or more of these functions:

Scanning

It is the process of converting hardcopy data (as per our scope of this chapter) into digital form.

The scanning produces a raster (picture) image that can be stored on a computer. The scanner should preferably have both flat bat and ADF (Automatic Document Feeder). The process of scanning involves acquisition of an electronic image through its original that may be a photo- graph, text, manuscript, etc. into the computer using an electronic image scanner.

An image is scanned at a predefined resolution and dynamic range. The resulting file, called “bit- map page image” is formatted (image formats describes elsewhere) and tagged for storage and subsequent retrieval by the software package used for scanning. Acquisition of image through fax card, electronic camera or other imaging de-

vices is also feasible. However, image scanners are most important and most commonly used component of an imaging system for transfer of normal paper-based documents.

There are a number of challenges in ensuring that digitized paper records remain accessible and useable. It is advisable that the digitization program should be carefully planned to meet ap- propriate standards and avoid the need to repeat work. Consideration must also be given to the categorization and storage of the original paper documents that are digitized i.e. codification and classification. The important digitization issues regarding accessibility and usability of digitized paper records including file formats, image qualities, the way the image files are stored and the process that is adopted to accomplish the digitization.

It is advisable to have high level of under- standing of the technical aspects of scanning within the organization prior to implementing a digitization program. The quality of digital image can be monitored at the time of capture by the following factors:

A. Resolution B. Bit depth C. Compression D. Threshold E. File format Resolution

Resolution denotes the number of dots spread over an area. This is measured in dots per inch (dpi), which is shortly termed as “DPI”. Pixel or dots form technically the images. When the resolution is increased, the images appear darker. Pixels (or the picture element), can be considered the build- ing blocks of all digital images. These are square cells of a single colour or shade. Pixels arranged in a regular grid pattern, form the digital image.

The resolution of a digital image is the density of pixels (measure per inch) that make up the image

pixels per inch (PPI). Occasionally, an image will be described by using its pixel dimensions rather then pixel density. By determining the source material dimensions in inches and using the pro- vided horizontal and vertical pixel totals, the pixel density of the image can be discovered.

Scanning at 150/200 dpi is done generally and if the image quality is poor the standard recommended maximum limit can be upto 600dpi. It is important to note here that the memory size increases with the increase of dpi. Therefore deci- sion related to scanning at judiciously decided dpi for different kinds of documents should be done in order to optimize memory. Some preservation projects scan at 600 dpi for better quality.

A standard SVGA/VGA monitor has a resolution of 640 x 480 lines while the ultra-high monitors have a resolution of about 2048 x 1664 (about 150 dpi). For example, a 1024 x 768 image displayed full screen on a 17” monitor (viewing size 13” x 10”) has a resolution of approximately 80 PPI.

Recommended Resolutions

DPI is a measure of printing resolution; in particular the number of individual dots of ink a printer or toner can produce within a linear one-inch space.

Due to the similarity with other measurements of graphical resolution, the DPI measurement is frequently misused, for instance, to specify a scanner’s sampling resolution or the number of pixels per inch in a computer display. Using DPI measurement in these cases is generally considered to be inaccurate and misleading, though the intended meaning is usually clear based on context.

In these cases, a measure given in DPI can be taken as the number of pixels per inch.

Bit Depth

Bit depth is the possible number of colour combinations of the colours. The number of bits used to define each pixel determines bit depth. The greater the bit depth, the greater the number of

gray scale or colour tones that can be represented.

The term dynamic range is used to express the full range of total variations, as measured by a densitometer between the lightest and darkest of a document.

Consideration With Regard To Digital Im- ages:

Digital images can be captured at varied density or bits pixel depending upon i) the nature of source material or document to be scanned; ii) target audi- ence or users; and iii) capabilities of the display and print subsystem that are to be used.

Bitonal or black & white or binary scanning represents one bit per pixel (either “0” (black) or

“1” (white) is generally employed in libraries to scan pages containing text or the drawings. In Gray scale scanning, multiple numbers of bits ranging from 2-8 are assigned to each pixel to represent shades of grey in this process is used for reliable reproduction of intermediate or continuous tones found in black & white photographs to represent shades of grey.. Although each bit is either black or white, as in the case of bitonal images, but bits are

combined to produce a level of grey in the pixel that is black, white or somewhere in between. Lastly in colour scanning, typically 2 (lowest quality) to 8 (highest quality) per primary colour are used for representing colour that can be employed to scan colour photographs. As in the case of grey-scale scanning, multiple bits per pixels. Colour images are evidently more complex than grey scale images, because, it involves encoding of shades of each of the three primary colours, i.e. red, green and blue (RGB). If a coloured image is captured at 2 bits per primary colour, each primary colour can have 2 or 4 shades and each pixel can have 4 shades for each of the three primary colour.

Evidently, increase in bit depth increases the quality of image captured and the space required to store the resultant image. Generally speaking, 12 bits per pixel (4 bits per primary colour) is considered minimum pixel depth for good quality colour image.

Recommended Bit Depths

A “bit” is the fundamental unit of computer information having two possible values, either 0 or 1.

Table 1.

Document Type Page Size Resolution

Standard text documents Up to A3 200 PPI

Oversized documents, e.g. maps Larger than A3 200 PPI

Photographs 6”x4”

7”x5”

9”x6”

6 0 0 P P I

4 3 0 P H

300 PPI

Table 2.

Document type Bit Depth

Black and white text only 1-bit bi-tonal

Text with some colour 8-bit colour

Text with shades of grey 8-bit grey

Colour drawings I presentations I graphics 8-bit colour

Black and white photographs 8-bit grey

Colour photographs 24-bit colour

Bit depth is the number of bits used to describe the colour of each pixel. Greater bit depth allows a greater range of colours or shades of grey to be represented by a pixel. Using multiple bits increases choice and variety, at the expense of increased file size. For example, using only 1-bit pixels gives 2 colours, usually either black or white. Using 4 bits gives 16 colour choices (i.e. 2 x 2 x 2 x 2).

Typical bit depths are described below.

Compression and File Size (Calculating File Size) Compression:

Image compression is the process of reducing the size of an image by abbreviating the repetitive information such as one or more rows of white bits to a single code.

Image files are evidently larger than textual ASCII files. A black & white image of a page of text scanned at 300 dpi is about 1 mb in size where as a text file containing the same information is about 2-3 kb it is thus necessary to compress image files so as to achieve economic storage, processing and transmission over the network.

The compression algorithms may be grouped into the following two categories:

Lossless Compression: The conversion process converts repeated information as a mathematical algorithm that can decom- pressed without loosing any details into the RULJLQDO LPDJH ZLWK DEVROXWH ¿GHOLW\ 1R LQIRUPDWLRQLV³ORVW´RU³VDFUL¿FHG´LQWKH process of compression. Lossless compression is primarily used in bitonal images.

Lossy Compression: Lossy compression process discards or averaged details that DUHOHDVWVLJQL¿FDQWRUZKLFKPD\QRWPDNH appreciable effect on the quality of image.

This kind of compression is called “lossy”

because when the image compressed using

“Lossy” compression techniques, is de- compressed; it will not be an exact replica of the original image. Lossy compression is used with gray scale/colour scanning and in particular with complicated images where merely appreciating the informa- WLRQZLOOQRWUHVXOWLQDQ\DSSUHFLDEOH¿OH savings.

Recommendation

Compression is a necessary in digital imaging but more important is the ability to output uncompressed true replica of images. This is especially important when images are transferred from one platform to another or are handled by software packages under different operating system.

Uncompressed images often work better than compressed images for different reasons. It is thus suggested that scanned images should be either stored as uncompressed images or at the most as lossless compressed images.

As indicated in the previous sections, the total number of pixels used to make up an image affects file size. Additionally, the colour depth of each of those pixels has a multiplying effect on the file size. In the example used earlier, an A4 page was digitized at 300 PPI giving a total of 8 700 867 pixels. The following table shows the number of bits that make up this image at varying colour depths and resolutions, and shows approximate file sizes.

Threshold

The threshold defined in bitonal scanning is the point on a scale, usually ranging from 0.255, to which gray values will be interpreted as black or white pixels. In bitonal scanning, resolution and threshold are the key determinants of image quality. Bitonal scanning is best suited to high-contrast documents, such as text and line drawings. For continuous tone or low contrast documents such as photographs, gray scale or colour scanning is

required. In gray scale/colour scanning both resolution and bit depth combine to play significant roles in image quality.

Image Enhancement

Image enhancement process is used to improve scanned images at the cost of image authenticity and fidelity. The process of image enhancement requires special skills and is time consuming. It invariably increases the cost of conversion. Typical image enhancement features available in image editing software include, filters, tonal reproduction, curves and colour management, touch, crop, image sharpening, contrast, transparent background, etc. In a page scanned in grayscale, the text/line art, and half tone areas can be decomposed and each area of the page can be filtered separately to maximize its quality. For example the text area on page can be treated with edge sharpening filters so as to clearly define the character edges, to remove the high-frequency noise, a second filter could be used and finally another filter could fill in characters. Gray scale area of the page could be processed with different filters to maximize the quality of the halftone.

File Formats

The digitally scanned images are stored in a file as a bit-mapped page image. The scanned image can be formatted and tagged in different formats

to facilitate easy storage and retrieval, depending upon the scanner and its software. National and international standards for image-file formats and compression methods exist to ensure that data will be interchangeable amongst systems. An image file format consists of three district components, i.e.

header which stores information on file identifier and image specifications such as its size, resolution, compression protocols, etc.; Image data consisting of look-up table and image raster and lastly footer that signal file termination information.

File formats encode information into a form which is intended for processing and use by specific combinations of hardware and software. Fortu- nately, the current technology trends of interoper- ability and compatibility have led to many file formats being supported on a variety of hardware and software platforms. This trend applies to image file formats with many image processing and viewing programs available for Windows, UNIX, and Apple computer systems.

The five file formats most commonly used for digitization are

i. Joint Photographic Experts Group (JPEG) File Interchange Format (JFIF)

This format of images is commonly used on the World Wide Web (WWW) and in digital photographic equipment best to photographs and complex graphics with continuous tones to mini- mize file sizes.

Table 3. Uncompressed file sizes for an A4 page digitized at different pixel depths and resolutions

Color depth Resolution

(PPI)

Total bits Uncompressed

file size (Mb)

1 bit bi-tonal 300 8 700 867 1.04

1 bit bi-tonal 600 34803468 4.15

8 bit grey or colour 300 69 606 936 8.30

8 bit grey or colour 600 278,427,744 34.00

24 bit colour 300 208 820 808 24.89

24 bit colour 600 835,283,232 101.96

ii. Tagged Image File Format (TIFF)

TIFF (Tagged Image File Format) is the most commonly used file format and is considered de facto standard for bitonal scanning. TIFF is a truly multi-platform protocol and is a good for scanning projects. Some image formats are proprietary, developed and supported by a commercial vendor and require specific software or hardware for displaying the printing scanned images. The TIFF format was developed in 1986 by Microsoft and Aldus and is currently maintained by Adobe.

TIFF files are used in desktop publishing, faxing, 3-0 applications and medical imaging applications. The sub-formats within the TIFF specification are. TIFF CCITT Group 3 and Group 4 which are the most widely used format in document imaging most fax transmissions are in TIFF Group 3 format. Other sub formats of TIFF support grayscale, colour depths of up to 64-bit and offer compression choices.

TIFF 6.0, was launched in 1992. The baseline version of TIFF 6.0 is fully compatible with applications designed to read earlier TIFF images and a number of additional features were added that require software to be specifically tailored to support the newer version. JPEG compression was included in the TIFF 6.0 specifications, and despite a technical revision in 1995 to overcome serious design flaws, there still remain problems with the use of this lossy compression within TIFF files. The draft TIFF version 7.0 specification ap- peared in 1997, is still to be released is expected to feature a more stable implementation of JPEG compression amongst other new features.

iii. Graphics Interchange Format (GIF)

Graphics Interchange Format (GIF) is a widely used image format introduced in 1987 by CompuServe. In the early years of the WWW, developers adopted GIF for its efficiency and widespread familiarity. A large proportion of the images on the Web are presented in GIF format,

and virtually all Web browsers that support graphics can display GIF files.

The GIF format supports a maximum 256 palettised colours or shades of grey so is most suited to discrete images, such as illustrations, black and white images, logos and line drawings rather than photographs.

iv. Portable Network Graphics (PNG)

Portable Network Graphics (PNG) is a lossless, portable, well-compressed storage format for images. The open-source and patent free PNG format was designed to replace the proprietary GIF format and, to some extent, the much more complex TIFF format. The second edition of PNG is an ISO standard - ISO/IEC 15948:2003 (E). The PNG format was designed specifically for use in online viewing applications such as the WWW, and the format offers a range of attractive features that should eventually make PNG the most common graphic format.

v. Portable Document Format (PDF)

The PDF format was created by Adobe to provide a standard storing and editing documents.

Portable Document Format (PDF) is a widely used proprietary file format. The PDF format was released in 1993 and is based on the Adobe Postscript printing language.

JPEG and PNG are non-proprietary formats while TIFF and PDF are proprietary formats which have freely available specifications.

CONSIDERATIONS FOR DIGITIZING

Dalam dokumen Developing Sustainable Digital Libraries (Halaman 85-90)