To Classify or Not to Classify
I recently was asked to help with a client’s KTM (Kofax Transformation Modules) project, because they were not pleased with the percentages of valid and/or correct extraction fields. My first question was, “Are you using subclasses?” The answer was, “No.” Subclassifying your top forms is an easy way to greatly improve your extraction results.
What I mean by that is instead of trying to use a single locator to find data from all of your documents with a “one size fits all” approach, you can use subclasses to first classify the document and then tune your locators specific to that form to look in a precise location for the information. For example, let’s say you need to find a “Case Number” off of all of your forms. Some forms might have the word “Case Number” above the text you need to extract. Others might have the word “Number” to the left of the data. Another might not have any text around the data to key off of at all. It’s difficult to add enough rules in one locator to catch all the possible scenarios. Furthermore, there are times when adding rules to help find data on one form will actually give you negative results from another. Subclasses can help by allowing you to create a specific locator to zero in on the information that you are looking for.
How many subclasses are enough? I like to use the 80/20 rule. When listing all of your documents in relation to volume, 80% of your volume should come from 20% of your forms. I know that there are exceptions to the rule, but this is a good place to start. I have done projects here at ImageSource where we subclassified the top 10, 20 or 50 forms. When forms are subclassified, the extraction averages go way up by using locators like the Advanced Zone Locator on structured forms. This locator is very helpful because once you draw a box around the data, you can set it to run its own cleanup and OCR of that zone rather than taking the original full-text OCR results. However, this is only really useful on forms that have been subclassified since you know exactly where in data is on the page. Format Locators are also very helpful because you know how the data is structured in relation to the form, and you can create a regular expression to look for text. This helps reduce the amount of incorrect possible alternative results. For the rest of the forms that are not subclassified, you still need to create the miscellaneous locators, but the idea is that the majority of your documents are being subclassified and coming through with very high extraction rates.
The other nice thing about KTM is that you can use locators at the parent class, and each of the subclasses will inherit the locator unless you specifically change them. An example of where this is helpful is for fields that use a database locator with a fuzzy lookup that applies to all the forms, but you don’t what to create a specific locator for all the subclasses. In addition, you can still use the incredible training power that KTM provides. When using the specific learning, it will apply the training to the particular subclass. I have found that with KTM there are many different ways to “skin a cat,” and this is just one of the methods that can dramatically improve your extraction results.
Brandon Konen
Systems Engineer
ImageSource, Inc.