Cutouts and page-generator (Tesseract OCR customization)

Author: Adam Dudczak (PSNC)

Tesseract  (https://code.google.com/p/tesseract-ocr/) is a well-known open-source OCR application, apart from other things it features layout analysis and training capabilities.  Because Tesseract is a command-line tool it is very handy to have it as part of larger digitisation workflow. This document describes how to create custom recognition profile for a specific kind of documents using web application called Cutouts (http://wlt.synat.pcss.pl/cutouts) and command line tools called page-generator (https://github.com/psnc-dl/page-generator).