Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

Impact Centre of Competence

  • Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource.
  • Scope: OCR Postcorrection
  • License: Various licenses
  • Content type: Groundtruth OCRed text
  • Size: 22M
  • Language:
  • Owner: Various owners
  • Link: https://zenodo.org/record/3515403