GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Impact Centre of Competence

Logo
not available
Description
GT4HistOCR contains ground truth for research in Optical Character Recognition (OCR) technology applied to historical printings in German Fraktur and Early Modern Latin. The ground truth comes in pairs of images of single printed lines as they appear in book pages (*.png) and their corresponding diplomatic transcriptions (*.gt.txt), which are UTF-8 strings preserving the character forms (glyphs) as much as possible within the UNICODE standard. These pairs of line images and their transcriptions can be directly used to train recognition models with, e.g., the open source OCR engines OCRopy or Tesseract. A total of 313,173 ground truth lines are provided.
Dataset content type
Groundtruth
Images
Dataset scope
Text recognition
OCR
Language

Size
313,173 ground truth lines are provided
Dataset License
CC - Attribution or equivalent
Dataset owner
not available
Dataset distributor
Springmann, Uwe; Reul, Christian; Dipper, Stefanie; Baiter, Johannes
Link
https://zenodo.org/record/1344132#.XBdmGPZKg_U
Contact
not available