Document imaging and processing go together. For example, scanning, the starting process of document imaging, simply produces some information about the light reflected by the scanned document. It is further processing by the scanner’s software that arranges and saves this information into a standard graphic format like JPEG.
The image created might not be the best of all images. It might have punch-hole marks, black borders, distorted characters, and so on. Image cleaning software then works with this less-than-perfect image and improves its quality. The result can be an image that’s better than the original paper document.
Image clean-up software exists that can straighten askew images, convert white text on black background into black text on white background, etc. Users can control the cleaning-up process so that the result comes out the way they want it to.
Character Recognition Processing
Text in graphic format, while readable by humans, is not computer-readable. Only by making the text characters computer readable can the document image be edited or indexed by the computer. And it’s important to make the document editable and indexable.
The workflow processes usually require some kind of editing of the original document image, say by adding comments, removing any personal content, and so on before forwarding the document further in the workflow.
Indexing is essential for making the document searchable (and retrievable). The indexing process will be explained in more detail in the next section.
As both editing and indexing are thus highly important, it’s very important to make the text characters computer-readable.
Character recognition programs such as OCR (Optical Character Recognition) and ICR (Intelligent Character Recognition) do the needed processing, and convert the graphic text characters into machine-readable ASCII etc. format.
In the case of character recognition, quality might be a problem. OCR might be confused between closely related characters and provide wrong interpretations of what it sees. Sophisticated processing algorithms can ensure that this does not happen. There are even programs that can recognize handwriting and convert them into machine-readable text characters.
Making the Documents Retrievable
Indexing programs process the documents to link certain identification words to the document. When this is done, the document can be retrieved from among the millions of documents in the computer repository by using the linked words.
Indexing can be based on all the words in the document content or a few distinctive tags that identify the nature of the content. Full-text indexing takes up more file space, but provides the entire content of documents to search by. To enable tag-based indexing, the document creators or others provide distinctive tags to describe the document content.
Along with manual entry of document indexing tags, zonal-OCR technology can be utilized to help automate this process. Specific zones for each document type are identified, and the text information extracted within these zones will be added to document index tags automatically. This process reduces the chance of human error when employing traditional data entry of the tags.
Conclusion
Different kinds of document imaging processes move paper documents into the enterprise content management system, creating digital images, converting image text into machine-readable characters, indexing the documents, and making them retrievable from among the millions of documents in the enterprise’s content repositories.