Tesseract Training Script

Building a Tesseract trained data bundle

This bash script assumes the presence, in the current directory, of
  1. A file ‘files.lst’ containing the base names of the images and box files (such that the list contains, for instance the line “image1” when image1.png and image1 are the respective image and box files)
  2. For each line li in this file containing a string si, an image file si.$EXTENSION and a box file si.box
  3. A word list named words.list


# settings for variables
FONTNAME=antiqua
# paths with Tesseract's binaries and shared data
TESS_BIN=/home/jesse/opt/bin
export TESSDATA_PREFIX=/home/jesse/opt/share/
TESS_SHARE=/home/jesse/opt/share/tessdata/
# document specific settings
LANGNAME=emp # early modern polish
NAME=$LANGNAME.$FONTNAME.exp
EXTENSION=.png
echo "combined 0 0 0 0 1" >font_properties &&
$TESS_BIN/unicharset_extractor *.box &&
for x in `cat files.txt`
do
   echo "tesseract training $x$EXTENSION"
   $TESS_BIN/tesseract $x$EXTENSION$NAME$x nobatch box.train.stderr
done
rm combined.tr
cat *.tr >combined.tr
$TESS_BIN/mftraining -F font_properties -U unicharset -O $LANGNAME.unicharset combined.tr && $TESS_BIN/cntraining combined.tr || exit
mv pffmtable $LANGNAME.pffmtable
mv normproto $LANGNAME.normproto
mv inttemp $LANGNAME.inttemp
wordlist2dawg words.list $LANGNAME.word-dawg unicharset
$TESS_BIN/combine_tessdata $LANGNAME.