Documents are created by capturing data, such as transactional data entry and scanning of paper documents. In today’s business environment, documents are mostly electronic as these can improve workflow and processing objectives dramatically. There are many data-capture methods, and many can be automated.
- Transactional data entry is the most obvious method of data capture. Relevant details of business transactions are entered by human operators using a computer application such as accounting software. Standard data-entry forms ensure uniform capture of information regarding each transaction.
- Scanning paper documents is another common method of data capture. Scanned images are processed to make the text in the images machine-readable, using software tools such as OCR (optical character recognition) and ICR (intelligent character recognition, used for poorer quality sources such as handwriting).
- Depending on the volume of paper documents to be scanned, individual or batch scanning might be used. Batch scanning involves scanning multiple documents automatically using scanner settings and is typically used for high-volume daily scanning or for back-file scanning.
- When batch scanning is utilized, the documents may need to be prepared in advance. The preparations will involve removing staples and paper clips, arranging for duplex scanning, where both sides of the documents have to be scanned, and configuring the scanner to handle the various sizes of paper involved, resolution required, etc.
- Scanned documents need to be indexed to ensure quick retrieval of documents needed later. Indexing can involve using the unique ID or title of the document, keywords that describe the content of the document, and/or the full text of the document. Full-text indexing allows you to specify one or more words or phrases contained in the document when using a search engine.
- Indexing can be facilitated by using barcodes, zonal OCR, and/or database lookups. These automated indexing methods can be found in document imaging and document management software. If these options are not available, manual data entry will usually be required to uniquely identify each document. Consistency of manual index data can be ensured by using drop-down lists of standard document names or descriptions.
- In cases where scanned documents are not uniform, such as single and multi-page invoices, correspondence, attendance sheets, and so on, these will be sorted in advance to enable automatic categorization by the software. Category changes are indicated with the use of separators, which may contain barcodes describing the category.
- OCR is not fully reliable, particularly where the quality of the source document is not good. An OCR quality-control procedure might be required to manually check the results and make any corrections required. Keyword indexing alongside full-text indexing can often offset any missed words or phrases from the OCR process.
- Data capture can also involve data conversion from one format to another. Such conversions can become necessary under varying circumstances. Data created with legacy systems might have to be converted into modern formats before the legacy applications are replaced.
- Data captured in languages such as XML can ensure long-term usability. XML files typically describe the schema used, enabling proper interpretation of the data by reference to the schema. This can minimize the common IT problem of data becoming unreadable as technology progresses.
- Data can be captured in several kinds of media, such as magnetic tape, hard disks, optical media, such as CD/DVD, solid state devices, such as USB flash drives, microfilm, and microfiche. Each has its own advantages and disadvantages in terms of costs, durability of the data, ease of use, and portability. Archival data, for example, are best captured on WORM (Write Once Read Many) devices that keep the primary data unchanged.
- Processes like data entry and OCR quality control are labor intensive processes, and these are ideal candidates for outsourcing. When outsourcing this work, care should be taken to review the practices that can ensure validity of the data. For example, in double-keying, two operators enter the data from each document, and the entry is accepted only if both operators entered the same data.
Data capture initiates the process of document management, including administrative use of the information provided by the documents. As data capture is a labor intensive process, implementations need to adopt systems and practices that help reduce costs and increase accuracy.