Capturing Content in Enterprise Content Management

Content capture is not a simple task when content is generated on an enterprise-wide scale. Enterprise Content Management Software seeks to capture not only structured data created by applications but also the vast amounts of unstructured data in such forms as word-processed documents, spreadsheets, presentations, drawings and other graphic images, e-mails, audio/video material, and even VoIP messages.

Even structured data created by different applications might not be compatible, and may have to be converted into a uniform structure before it’s stored in the ECM content repository.

The typical solution to work with such varied formats of data is to use XML-based tools that include facilities to describe documents and their fields.

Capturing Metadata

Enterprise Content Management seeks to enable users to search, query, and analyze the content stored in its repository. Considering the huge volumes and variety of data stored in the repository, search capability of content becomes a big problem.

This can be improved by attaching metadata with each piece of content. Metadata include tags and descriptions to identify the content of the data.

Metadata need to serve different users searching the data with different analytical objectives.

Capture tools that can automatically capture metadata thus become high-value tools in an ECM environment.

Capturing Content Originating at Numerous Points

A large enterprise generates content in different formats from numerous locations. The locations might be different departments and functions or branch offices possibly spread all over the world.

Distributed capture technologies can save not only on costs, e.g., copying and shipping costs, but also make the content accessible much faster to users. In the case of paper documents, processes are streamlined and security improved by immediate conversion to electronic formats at the point of origin.

Lower costs, faster access, and streamlined processes lead to improved business results. Also, in case of a natural or other kind of disaster, electronic data can be recovered through disaster-recovery procedures (provided the required procedures have been planned and implemented).

Capture Technologies

Scanning paper documents into electronic images and converting the images into machine-readable electronic documents is a key element of content capture in the ECM environment.

If the conversion is done immediately after the documents are created, relevant content becomes available on an enterprise-wide scale much faster. Content generated at some far-away branch office thus becomes immediately available to headquarters and other offices.

Converting into machine-readable format is accomplished using technologies like Optical Character Recognition—OCR. OCR comes in different flavors. Enhanced versions (likely to be more expensive) can work with odd-sized, double-sided, and even corrupted paper documents. Automatic language detection is another high-value capability in a global enterprise.

Reliability is important for OCR, as the possibility exists that similar characters might be converted wrongly, leading to unintelligible content.

Better capture tools and technologies are appearing in the market. For example, there are mail-room processors that can convert thousands of envelopes and their contents into scanned and text-recognized images with just one operator. Capabilities such as extracting documents from envelopes, detecting duplicate documents, e.g., duplicate invoicing, and automated sorting and scanning are available in more expensive models.

Increasingly, scanning facilities and post-scan image processing (and even content management) come bundled together. Such bundling provides high ROI on the investments, with savings in labor, and paper-handling costs.


In an Enterprise Content Management environment, capturing content is not a simple task. The capture functionality must deal with issues like huge volumes, varying document formats, numerous originating points for the content, different languages, and difficult-to-scan/read documents.

Was this article helpful to you?

Comments are closed.