Click to start your FREE TRIAL of CapturePoint, Document Scanning Software today!

Classify-Rule Principles

  • When choosing a template document…
    • Use images that are at least 300 DPI.
    • Use the cleanest and least-rotated/skewed page. Unscanned digital originals (i.e. near perfect) are acceptable for building rules.
    • Select a sample that is the most data-complete, as this will affect index rule quality.
      • Data-complete: the largest amount of potential information / the most amount of individual words or numbers that could exist in a phrase or line.
      • Selecting a document with missing information (or with very little data in a field compared to other documents) will lead to inaccurate index field results.
    • For color images that might be scanned in color or black-and-white, use the color version when building rules.
  • Only use text that…
    • …is relatively large, clear and easily readable.
    • …doesn’t change (with the exception of barcodes).
    • …doesn’t move around the document.
    • …is distinct and doesn’t repeat somewhere else on the document. Creating rules for repeating information can have a negative effect on index-field results.
  • Always start with an approach that you feel should work before attempting complex configurations.
  • Build rules for items with larger font sizes before those with smaller font sizes.
  • Don’t use graphic data like logos or pictures.
  • Include at least one item from each extremity of the page. This affects index-rule result quality.
  • More rules are usually better than fewer.
  • You don’t necessarily have to draw a classify rule on the label for an index field, but it can help with index-rule result quality.
  • Try to avoid building classify rules for text that repeats on the same page. Doing so can have a negative effect on index-field results.
  • If necessary, use regex to weed out false-positive results (i.e. “Page 2” being classified).
  • Inverted text (e.g. white text on a black background) should pose no issue as long as the contrast between the two is relatively large and consistent. In other words, medium-gray text on a dark-gray background is not preferred.
  • Building multiple classify rules for sub-phrases within a phrase may provide better results than building a single rule for the entire phrase. These sub-phrases should follow the aforementioned rules.  Use this technique only if you have determined that there is otherwise-uncontrollable shift occurring with index fields.
    • An example of this would be splitting a single rule for the phrase “This page intentionally left blank.” into two rules with the sub-phrases “This page intentionally” and “left blank.”

Index-Rule Principles

  • Draw a rectangle around the entire field (i.e. don’t try to split/parse the field with the rectangle); use Regex and/or Output Translation to clean up unwanted text.
  • Draw the rectangle large enough that it accounts for additional data that is not on the current document, but may be on others.
    • Leave only a little extra room around the field data; it’s not usually necessary to pad, except for fields using handprint recognition.
  • If possible, only capture data that…
    • …doesn’t move around the page (Presence on Page rules may be a better fit for these).
    • …is near other classify and/or index rules.
  • If the data exists in multiple places on a document, use the location that is the most fixed and distinct.
  • Only use Presence on Page when necessary and when the field has a distinct pattern suitable for a regular expression.
  • Handprint recognition will recognize only printed characters, not script or cursive characters.
  • Rules that engage handprint recognition do not make use of our homographic-transform / perspective-correction algorithm.

Example Regular Expressions – Use only to extract contiguous text from a string.

Text Regex Result
My SSN is 123-45-6789. \d3-\d2-\d4 123-45-6789
$1,234,567.00 [\d,]+ 1,234,567
Order ABC-123 placed successfully. [A-Z] {3}-\d{3} ABC-123

 

Example Output Translantions – Use only to remove, add, replace, and/or restructure data.

Type Text Input Regex Replacement Result
Remove 123-45-6789 123456789
Add 1234567 (\d)(?=(\d{3})+$) $1, 1,234,567
Replace 863.555.1212 . 863-555-1212
Restructure John Smith (\w+)(\s)(\w+) $3, $1 Smith, John

Was this article helpful to you?

Comments are closed.