Output

Output formats

Output formats are: RTF, DOCX, XLS/XLSX, PPTX, PDF, PDF/A, HTML, TXT/CSV,(finereader)  XML, ALTO (xml), FB2 (feedbook, an ebook format), EPUB (ebook format),  ODT (open office document format).

In digitization workflows, one typically prefers XML-based solutions, which allow flexible handling of the recognized text, and in particular, store information about the location of text regions, lines, words, and (possibly) characters in the source image, thus enabling, for instance, highlighting of search terms in the original image in a retrieval application.

XML export is not possible in the two desktop editions of Finereader, or in the online OCR client. The recognition server, the cloud API and the Engine SDK do support this option. There are two supported XML output schemata:

1. ABBYY xml: this is an XML format defined by the company (cf. for instance www.abbyy-developers.com/en:tech:features:xml). This format allows the highest degree of control of the output, with options to store detailed glyph properties and alternative recognitions of words and characters.

2. ALTO xml: this is a widespread standard for optical character recognition results, cf.www.loc.gov/standards/alto/.  ALTO export is possible in recent releases of the SDK and the extended recognition server, cf www.abbyy-developers.eu/en:tech:features:alto. It is currently not possible to export glyph coordinates in this format.

If alternative formats are required for your digitization workflow, there are two main options:

1. Conversion of ALTO or ABBYY XML to your desired format. Cf for instanceable.myspecies.info/abbyy-xml-tei-xml .

2. Implementation of an SDK application which directly exports the recognition result. As an example, we can mention the PAGE xml exporter developed in the IMPACT project.