Abbyy FlexiCapture For Transcript Processing – A More Detailed Review

Last time we took a look at the Abbyy FlexiCapture product to perform College Transcript processing in a broad overview.  This time I would like to start looking at some lower level details of the product that show where FlexiLayouts end and Project Level Document Definitions begin.

Let’s start with some basic definitions.  A Layout is used to help the Recognition Engine to identify the document in a batch as belonging to a particular Document Definition.   A Layout is also used to help the Recognition Engine to find the locations of the data to extract and place in fields the user can then see and modify if necessary.  A Document Definition is used to determine the type of processing to perform on the document, the fields contained in the document and the type of data those fields should have.

Now for some details on the FlexiLayout Design Studio.  The studio can load a sample image or document to OCR the image and allow the designer to start identifying specific locations on the image to find OCR data elements like text, separator lines, white space, or pictures.  These identified elements can then be used to locate data locations to extract information to be used for the field data.  The field data can be either single fields or table fields and can be specified as repeatable if they occur more than once.  In the case of a Transcript, field data locations for Student Name, Date of Birth and Student Identifier are usually single fields located near the top of a page.  Field data like Sessions and Course information are more likely to be a table of fields that are repeating on the page.  In attempting to capture this data into the correct field a fixed identifier like Static Text must be used to limit the search for the actual data to a specific region.  For something like Student Name it might be a static text label to the left of the name of the student that can then be used to anchor where the name of the student is located.  Once the anchor is found, then the field definition for the name of the student can be determined from the relationship of where the anchor is located to where the name of the student is located by using an x and y offset from the anchor location to draw a box around the information to extract.  These same types of steps are used to locate the data to extract for all the other fields to capture from the image including the table fields.

Now that the layout identifies the location of the fields to extract data, it is transferred over and serves as the basis for a Document Definition.  The fields identified in the layout are created in the Document Definition with the same names and data types from the layout.  At this point, you can use the Document Definition as is or modify it to add additional fields and or data validation scripts.  This turns out to be a very useful feature since not all transcripts contain exactly the same data, but to release this data to a backend system or database does require some consistency in the names of the fields and their types.  So if one Document Definition for a transcript has the Student SSN, then all Document Definitions for other transcripts should have a Student SSN even if the actual transcript image does not contain such a value.

In addition, data validation scripts written in either VB or Java can correct the data the Recognition Engine extracts so the operator does not have to perform this work.  For example, when reading the Course data for Units Attempted the value read should be 3.00 but in a lot of cases is read as 3. 00 with a space in the middle.  A data validation script can be written to automatically remove the space to get to the correct value.  A data validation script can also help to split up a field into multiple values.  For instance the Student Name is most likely defined as a single field that contains the whole name of the student.  But many backend systems or database like to have the name broken down into its parts like First Name, Middle Name and Last Name.  Therefore a data validation script can be written to split the name into its parts and assign the data to the separate fields.

Again we have only just started to scratch the surface of the capabilities and features of the Abbyy FlexiCapture product.  To really get a feel for this product requires a week long training course that is way beyond this blog.  If you have any interest in how this product can help you then by all means contact an ImageSource Sales Representative through our web site at www.imagesourceinc.com

 

chrishillenburg