The Impact Centre of Competence dataset contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment.
A carefully selected subset of these images has been reproduced with accompanying “ground truth”. In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.
The ground truth provided by the Impact Centre of Competence is stored and exchanged via xml instances in the Page Analysis and Ground-truth Elements (PAGE) format, which was developed by the University of Salford, and which is maintained at: http://schema.primaresearch.org/PAGE. A paper explaining the development of PAGE was delivered by the University of Salford at ICPR2010 and is available here.
The Impact dataset is mainly distributed under attribution, non-commercial, share alike license, but please check every dataset for more information about its licensing schema.
A copy of this dataset with further browsing features can be also found at https://www.primaresearch.org/datasets