Kofax Transformation Modules - format locators and dynamic regular expressions - ...

:

Part 1: An introduction to format locators and regular expressions

Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract  metadata out of the electronic images (these are the scanned pages of the documents, faxes or emails) and release the data and the document to business applications. A core part of these systems is a technique called freeform field extraction. Freeform extraction means the search for metadata (for example an insurance number) is working independent of the document layout.

This is the main principle of freeform extraction: each value, which we try to extract, has a special syntactical structure. As an example, the insurance number of an insurance company could have the following structure: YYYY/1234567890 (four digits as the year / maximal 10 digits as a number). Examples: 2012/45 or 2011/47123.

This insurance number may be written somewhere on a document. But this number is not written without a certain context, as the customer or the clerk has to identify this number too. Therefore you will find words near the number as “insurance number”, “ins.no.”, “Ins.nbr.”, “contract number”, … There is a geographical relation between the number and its describing text. This text may be written to the left, to the right, above or under the insurance number. Furthermore the distance between text and number may be used as an attribute for the extraction.

codecentric is using ‘Kofax Transformation Modules’ (KTM) as one product for automatic classification and data extraction. KTM can be integrated as a module into the capturing solution Kofax Capture (see Stefan Blank’s Blog).

KTM uses internal tools called ‘format locators‘ for the identification of values. Within such a locator, you define the structure of a value (insurance number), the describing text (“insurance number”, “contract number”) and the geographical relation between value and text.

Here is a snippet of an example document with an insurance number (unfortunately a German document):

*** Remark: Versicherungsnummer = insurance number ***

A format locator for the extraction of the insurance number could be defined as follows (screenshots are from the KTM Project Builder):

There is a so-called regular expression, which describes the general structure of an insurance number: 20\d{2}/\d{1,10}

Year(four digits) / 1 to 10 digits: 2011/47123

Exactly this is described by the regular expression:

  • 20 are the first two digits of the year
  • \d{2} represents exactly two digits
  • / represents the character /
  • \d{1,10} represents a number with 1 to 10 digits

The expression 20\d{2}/\d{1,10} will find all matching strings somewhere on the document. Besides the insurance number these could be other strings, which match the regular expression (phone numbers, bank codes, …) In order that only the insurance number will be taken, the describing word(s) have to be defined within the format locator:

KTM-FL-EVAL-EN75

The line:

KTM-FL-EVAL-LINE-EN75

means for example: the term “contract number” must be found to the west (left) of the matching number in a ‘near’ distance. If the term is found there, it scores 100 points . You can add all terms that may describe an insurance number.

By the combination of the regular expression with the describing terms, KTM is able to read the insurance numbers out of all documents and to refuse the improper matches – independend of the number’s position on the document. The winner is the match with the highest scoring (points).

You can test this within the KTM Project Builder just by pushing the ‘Test’-button:

In a real customer correspondence to an insurance company the insurance number may be written in several different notations. Instead of 2011/47123 you may find 2011-47123, 2011 47123 or even 201147123. In order to mach these numbers with the format locator, the regular expression will be changed slightly in a real environment.

All of the above notations will be found by this regular expression:
20\d{2}.?\d{1,10} 

The point in the middle of the expression represent any single character. The following question mark declares the preceding character as optional. With this definition KTM will find all of these:

2011/47123
2011-47123
2011 47123
20114712

In real customer projects the extracted insurance number will be checked against the contract database. If the number exists, the number and the document (maybe with other extracted metadata) will be electronically routed to the relevant clerk or business application. If the database check was not successfull (or an insurance number was not found) the document must be validated manually. KTM provides a validation modul for this purpose, which can also be integrated into the Kofax Capture workflow.

Not to long ago, I was thinking that it is possible to extract all metadata out of a document with KTM by using format locators and regular expressions – as long as the document is not handwritten. Recently we had to setup a document classification/extraction project at a scan service provider who works for financial institutions. The challenge was to develop one project work for several clients. We had to deal with document types, where the described ‘static’ format locators could not deliver sufficent results. We were in need of some type of a format locator whose regular expression could be modified during runtime (depending on client specific data). As KTM provides a VB-compatible scripting language and due to some knowledge of the KTM object model, we were able to master this challenge.

The second part of this blog series will cover a way how you can dynamically change the regular expression of a format locator during runtime by using KTM’s scripting language.

New: article about document classification with KTM

New: KTM and insurance companies: Document Process Automation