Training Tesseract

Tesseract is retrainable. Documentation on the training process is available athttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3. A shell script implementing the training process is available in the appendix.

Though this takes care of the purely technical part of the process, it defines a way of compiling training data to Tesseract format rather than an approach to developing it with optimal recognition results.

The basic requirements for training a font/language combination are:

1) A combination of a (usually black and white) page image and a text file (box format) listing containing a number of lines with a character and the bounding box of an occurrence of that character in the page image. A box file contains lines like:

D 124 2906 150 2954 0

e 153 2908 171 2950 0

P 187 2900 237 2959 0

a 240 2908 263 2947 0

e 263 2909 285 2947 0

l 285 2909 299 2958 0

b 299 2911 327 2958 0

r 327 2901 345 2951 0

u 348 2910 378 2950 0

2) Word lists: a list of frequent words and a broader list aiming at a more comprehensive coverage of the language

3) Some small configuration files.

From this, command line utilities supplied with the basic Tesseract distribution can build a new trained data bundle. An example script for the process is given in the appendix.

This does not yet give us guidelines on how to proceed when we want to train for a specific collection. Ideally, one might hope that a set of images together with a set of ground truth transcriptions might be enough to train the engine. In practice, this is not so easy.

First, since there is no practical tool available to align the images with a plain text ground truth transcription, we need to enhance our ground truth with character bounding boxes or even bounding polygons.Moreover, there are some restrictions: character bounding boxes should not overlap; there should only be one font per training image/box file pair.

More importantly, while one might expect that damaged instances of a glyph shape might also be informative to the character classification process, this appears not to be the case. Tesseract assumes its training material to represent prototypical shapes rather than possibly noisy instances.

Several solutions have been developed to bridge the gap between ground truth data and a Tesseract trained data bundle.

User interfaces have been developed to create (or manually correct automatically created) box files, for instance JtessBoxEditor (http://vietocr.sourceforge.net/training.html) or web-based Cutouts (http://wlt.synat.pcss.pl/cutouts).

We also mention two approaches based on the PAGE XML ground truth format (http://www.primaresearch.org/tools.php). This format allows user-friendly development of ground truth material with the option of specifying coordinates for text regions, lines, words and individual glyphs with the Aletheia tool (ibidem), which can be used freely for non-commercial purposes.

  • The Poznan Supercomputing and Networking Center  (PSNC) has developed two handy tools to automatically develop training data starting from an image and a PAGE XML ground truth file with glyph coordinate information. The first tool cuts out the glyphs from the image, creating individual images. After this stage, noisy character images may be removed. The second tool  recombines the glyphs into a “cleaner” input image which can be used in the Tesseract training process, and also generates the required box file. The use of these tools is documented in the file IC-Tesseracttrainingworkflow-200913-0919-9296.pdf, included in the training package.
  • In the EMOP  project, a tool Franken+ has been produced. Provided the binarised image and the resulting XML file generated with Aletheia, Franken+ extracts individual TIFF images for each letter blocked-out using Aletheia, giving the user the opportunity to hand-pick the best instances of each letter (thus producing a “font” consisting of only hand-picked images). Using this font, Franken+ can then create synthetic TIFF images of text “printed” using this font, with corresponding BOX files, which are then used to train Tesseract OCR engine in order to OCR images of documents printed with the relevant historic font. Using these synthetic images and their corresponding BOX files, Franken+ then automates the Tesseract font training process and allows a user to test this font.

For a comparison between the FineReader and the Tesseract OCR trainability, cf. for instance the case study  http://lib.psnc.pl/dlibra/doccontent?id=358, which we include with the current SUCCEED training materials for Tesseract.