Scenario


An Advanced Document Layout and Text Ground-Truthing System for Production Environments

Aletheia is a comprehensive tool for semi-automated production of ground truth and annotation of document images on page level (Unicode text, layout, metadata, reading order, layers, broder, print space, etc.)

Abstract


Large-scale digitisation has led to a number of new possibilities with regard to adaptive and learning based methods in the field of Document Image Analysis and OCR. For ground truth production of large corpora, however, there is still a gap in terms of productivity. Ground truth is not only crucial for training and evaluation at the development stage of tools but also for quality assurance in the scope of production workflows for digital libraries. This paper describes Aletheia, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. It aids the user with a number of automated and semi-automated tools which were partly developed and improved based on feedback from major libraries across Europe and from their digitisation service providers which are using the tool in a production environment. Novel features are, among others, the support of top-down ground truthing with sophisticated split and shrink tools as well as bottom-up ground truthing supporting the aggregation of lower-level elements to more complex structures. Special features have been developed to support working with the complexities of historical documents. The integrated rules and guidelines validator, in combination with powerful correction tools, enable efficient production of highly accurate ground truth.

Aletheia is a system that offers:

  • System Features
  • Mature XML schema which is part of the PAGE format framework
  • Targets production environments (large scale digitisation)
  • Built-in Image Operations
  • Binarisation
  • Noise Removal
  • Ground Truth Production
  • Border and Print Space
  • Layout Regions
  • Modification of Layout Regions (merge, split, edit outline)
  • Region Attributes
  • Text Content (Unicode with virtual keyboard for special characters)
  • Reading Order
  • Layers
  • Text Lines, Words and Glyphs also with text content
  • Validation against Ground Truthing Rules and Guidlines

Publications

Availability

For information on availability and licencing, please contact PRImA Research.

OCR Post-correction and Enrichment