The speed of search engines that select documents matching your search query, from among billions of documents, might amaze you. The engines don’t do this by going through every one of these billions of documents. Instead, they search the words you entered in an index already created, and display the documents that contain these words.
The type of index that lists words and then the documents against each word is known as an inverted list index. It is .inverted. because straightforward indexing would have listed the documents first, and then the words in each document. However, such forward indexing would make it immensely more difficult to list all the documents that contain your search words.
Based on the search-engine example above, the purpose of indexing is to facilitate subsequent retrieval. Without an efficient index, it will simply be impossible to find what you want from the huge mass of information available on the Web, for example.
The primary search engines typically do ‘full-text’ indexing, i.e. they go through all the words in each document and index the document against all these words (except perhaps common words like ‘the’).
Full-text indexing is not the only kind of indexing. Documents can be indexed by the metadata (data about the content of the document) attached to each document. For example, documents can be tagged with their main topic, date of creation, author, etc. at the time they are created, and then indexed on such metadata.
Full-text index files need a great deal of space, whereas indexing on metadata can reduce the space requirements dramatically. However, metadata indexing requires the searchers to have some idea about the kind of metadata that can bring up the document they are looking for, such as the date of its creation or a unique identifier.
Indexing is particularly important for unstructured data, such as e-mails, market research reports, correspondence, etc. Structured data such as sales invoices of a business are stored in databases that have their own document-retrieval algorithms.
Document management systems might be able to extract metadata automatically from documents. DMS systems might also make it possible to create metadata in a standardized and structured fashion, say by a barcode that contains certain standard details.
Document retrieval against search queries often becomes difficult because the searcher might not know exactly how to specify the search query. In such a case, broad search queries might be used and these return numerous documents. It then becomes necessary to rank the documents in an order of apparent relevance to the query.
Relevance-based ranking can become inaccurate if the documents are not structured in a manner to enable accurate identification of their topic. This is what happens when webmasters stuff keyword phrases in their pages. To cope with such problems, semantic indexing looks at a number of words in the document to identify its topic.
Scientific documents might be indexed using the scientific notations with their unique syntax, instead of the natural-language content of the documents. This can improve the relevance of search results.
Document management systems often work with scanned document images. Indexing programs will not be able to read the text in images as such. The imaged documents are processed further using Optical Character Recognition (OCR) programs that convert the image into machine-readable text. The accuracy of the OCR programs will then determine the quality of the index.
In addition to indexing, electronic document management systems typically use well-organized directories to help speed up retrieval of desired documents. Users browse to the relevant directory and retrieve the right document instead of having to go through a list of documents brought up by search queries.
Document indexing is a powerful technique to aid subsequent retrieval of documents from repositories that contain thousands of documents. Documents might be indexed by their full-text content (so that they can be retrieved by any word in the content) or by metadata attached to the document such as a unique identifier, date of creation, or the main topic of the document.