Click to start your FREE TRIAL of CapturePoint, Document Scanning Software today!
Classify-Rule Principles
- When choosing a template document…
- Use images that are at least 300 DPI.
- Use the cleanest and least-rotated/skewed page. Unscanned digital originals (i.e. near perfect) are acceptable for building rules.
- Select a sample that is the most data-complete, as this will affect index rule quality.
- Data-complete: the largest amount of potential information / the most amount of individual words or numbers that could exist in a phrase or line.
- Selecting a document with missing information (or with very little data in a field compared to other documents) will lead to inaccurate index field results.
- For color images that might be scanned in color or black-and-white, use the color version when building rules.
- Only use text that…
- …is relatively large, clear and easily readable.
- …doesn’t change (with the exception of barcodes).
- …doesn’t move around the document.
- …is distinct and doesn’t repeat somewhere else on the document. Creating rules for repeating information can have a negative effect on index-field results.
- Always start with an approach that you feel should work before attempting complex configurations.
- Build rules for items with larger font sizes before those with smaller font sizes.
- Don’t use graphic data like logos or pictures.
- Include at least one item from each extremity of the page. This affects index-rule result quality.
- More rules are usually better than fewer.
- You don’t necessarily have to draw a classify rule on the label for an index field, but it can help with index-rule result quality.
- Try to avoid building classify rules for text that repeats on the same page. Doing so can have a negative effect on index-field results.
- If necessary, use regex to weed out false-positive results (i.e. “Page 2” being classified).
- Inverted text (e.g. white text on a black background) should pose no issue as long as the contrast between the two is relatively large and consistent. In other words, medium-gray text on a dark-gray background is not preferred.
- Building multiple classify rules for sub-phrases within a phrase may provide better results than building a single rule for the entire phrase. These sub-phrases should follow the aforementioned rules. Use this technique only if you have determined that there is otherwise-uncontrollable shift occurring with index fields.
- An example of this would be splitting a single rule for the phrase “This page intentionally left blank.” into two rules with the sub-phrases “This page intentionally” and “left blank.”
Index-Rule Principles
- Draw a rectangle around the entire field (i.e. don’t try to split/parse the field with the rectangle); use Regex and/or Output Translation to clean up unwanted text.
- Draw the rectangle large enough that it accounts for additional data that is not on the current document, but may be on others.
- Leave only a little extra room around the field data; it’s not usually necessary to pad, except for fields using handprint recognition.
- If possible, only capture data that…
- …doesn’t move around the page (Presence on Page rules may be a better fit for these).
- …is near other classify and/or index rules.
- If the data exists in multiple places on a document, use the location that is the most fixed and distinct.
- Only use Presence on Page when necessary and when the field has a distinct pattern suitable for a regular expression.
- Handprint recognition will recognize only printed characters, not script or cursive characters.
- Rules that engage handprint recognition do not make use of our homographic-transform / perspective-correction algorithm.
Example Regular Expressions – Use only to extract contiguous text from a string.
Text | Regex | Result |
My SSN is 123-45-6789. | \d3-\d2-\d4 | 123-45-6789 |
$1,234,567.00 | [\d,]+ | 1,234,567 |
Order ABC-123 placed successfully. | [A-Z] {3}-\d{3} | ABC-123 |
Example Output Translantions – Use only to remove, add, replace, and/or restructure data.
Type | Text | Input Regex | Replacement | Result |
Remove | 123-45-6789 | – | 123456789 | |
Add | 1234567 | (\d)(?=(\d{3})+$) | $1, | 1,234,567 |
Replace | 863.555.1212 | . | – | 863-555-1212 |
Restructure | John Smith | (\w+)(\s)(\w+) | $3, $1 | Smith, John |