Output

Output formats

There are two possible full text output formats: plain text and hOCR. hOCR  is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal.  Both are not entirely suitable for deployment in digital libraries, where one typically prefers XML-based solutions. Conversion of hOCR to ALTO or direct ALTO output is an obvious desideratum. No such utility seems to be available.

Another output format, which is relevant in the training process, is the box format, which gives bounding boxes for each recognized character.